A COMPARISON OF LEAST-SQUARES AND BAYESIAN FITTING TECHNIQUES TO RADIAL VELOCITY DATA SETS

Size: px

Start display at page:

Download "A COMPARISON OF LEAST-SQUARES AND BAYESIAN FITTING TECHNIQUES TO RADIAL VELOCITY DATA SETS"

Meredith Harrison
6 years ago
Views:

1 A COMPARISON OF LEAST-SQUARES AND BAYESIAN FITTING TECHNIQUES TO RADIAL VELOCITY DATA SETS A thesis submitted to the faculty of San Francisco State University In partial fulfillment of The requirements for The degree Master of Science In Physics by Peter Driscoll San Francisco, California July, 2006

2 CERTIFICATION OF APPROVAL I certify that I have read A Comparison of Least-Squares and Bayesian Fitting Techniques to Radial Velocity Data Sets by Peter Driscoll, and that in my opinion this work meets the criteria for approving a thesis submitted in partial fulfillment of the requirements for the degree: Master of Science in Physics as San Francisco State University Debra Fischer Professor of Physics Department of Physics & Astronomy San Francisco State University Ron Marzke Professor of Physics Department of Physics & Astronomy San Francisco State University Eric Ford Miller Research Fellow Department of Astronomy University of California Berkeley

3 A COMPARISON OF LEAST-SQUARES AND BAYESIAN FITTING TECHNIQUES TO RADIAL VELOCITY DATA SETS Peter Driscoll San Francisco State University 2006 An integral part of quantifying the orbits of extrasolar planets is fitting keplerian orbits to radial velocity data. We focus on two specific techniques of fitting orbits to radial velocity data: a Bayesian Markov Chain Monte Carlo (MCMC) technique and a frequentist Levenberg- Marquardt Monte Carlo (LMMC) technique. Our goal is to apply these two techniques to a range of synthetic radial velocity sets and present a comparison the performance of each. The MCMC is designed to yield the full posterior probability distribution for each orbital parameter. The LMMC identifies the best-fit orbital parameters of a given data-set. We identify which technique is best suited to fit to partial orbits, sparse observations, relatively large intrinsic variation, large orbital eccentricities, and low signal-tonoise synthetic data sets. We find that each method is accurate when applied to data sets containing more than a full orbital period. It is possible to constrain some orbital parameters from partial period or sparsely sampled data sets, but we identify the regimes that challenge either one or both of these methods. I certify that the Abstract is a correct representation of the content of this thesis. - Chair, Thesis Committee Date

4 ACKNOWLEDGMENTS I would like to thank my thesis mentor Debra Fischer and other members of the California and Carnegie Planet Search Team at San Francisco State University and Eric Ford at the University of California at Berkeley for their valuable conversations, contributions, and encouragement. iv

5 TABLE OF CONTENTS List of Tables vi List of Figures vii Introduction Kepler s Equations of Orbital Motion Orbit Fitting Methods for Radial Velocity Data Levenberg-Marquardt Monte Carlo Method Markov Chain Monte Carlo Method Transition Probability Function Convergence Test Testing Fitting Techniques with Synthetic Radial Velocity Observations Sythetic Data Sets MCMC Simulations LMMC Simulations Class 1: Varying the Observational Parameters Worst Cases: Data Sets c1, c2, and c Class 2: Varying the Orbital Eccentricity Class 3: Synthetic Observations of Jupiter Discussion Strengths and Weaknesses of LMMC and MCMC Conclusion References v

6 LIST OF TABLES Table Page 1. Class 1: Varying the Observational Parameters Class 2: Varying the Orbital Eccentricity Class 3: Synthetic Observations of Jupiter vi

7 LIST OF FIGURES Figure Page 1. A Typical Orbit Acceptance Rates Class 1 MCMC Contour Histograms MCMC and LMMC Period Median and Peak MCMC and LMMC Semi-Amplitude Median and Peak MCMC and LMMC Eccentricity Median and Peak MCMC and LMMC Argument of Periastron Median and Peak MCMC and LMMC Time of Periastron Passage Median and Peak MCMC and LMMC Amplitude Off-set Median and Peak MCMC and LMMC Chi Values for Peak Parameter Values Worst Case Data Set Radial Velocities and Parameter Histograms Class 2 MCMC and LMMC Eccentricity Histograms and Contours Class 3 Radial Velocities and Parameter Histograms vii

8 1 Introduction Over 170 extrasolar planets have been discovered to date. Most have been detected from the Doppler shift of their host stars. If the orbital plane of the star-planet system is inclined with respect to the plane of the sky then the wobble of the host star about the center-of-mass can induce an observable Doppler shift in its stellar spectrum. The Doppler shift of stellar spectral lines is an effect of the radial component of the motion of the star. If spectroscopic precision is sufficient then a time series of radial velocity can be modeled with a Keplerian orbit and properties of the orbiting planet can be inferred. Once the orbital properties of a planet have been confidently derived from the radial velocity data the discovery of a new planet is announced. Once published, the orbital parameters of extrasolar planets are used in a variety of applications to study theoretical properties of planets and planetary systems. Therefore the responsibility to correctly estimate the orbital parameters of newly discovered planets and their associated uncertainties is critical to our understanding of planets in general. In this paper we discuss the application and compare the performance of two radial velocity fitting techniques. The first is Levenberg-Marquardt, a leastsquares frequentist approach with Monte Carlo bootstrapping. The second is Markov-chain Monte Carlo, a Bayesian approach. We test these techniques on three classes of synthetic radial velocity data. The first class includes 57 synthetic radial velocity sets, each of the same single-planet Keplerian model orbiting a solar-mass star, but varying in phase coverage, density of observations, intrinsic stellar jitter, and time of first observation. The second class includes 21 synthetic 1

9 data sets varying in orbital eccentricity from 0.0 to 0.6. The third class includes 4 synthetic observations of a Jupiter analog with varying phase coverage and density of observations. We then summarize and discuss the performance of each fitting technique. 2 Kepler s Equations of Orbital Motion We begin by introducing the physical model to be used by both techniques and the mathematical equations needed to describe the orbit of a typical planet. Planets orbit their primary star according to Kepler s equation, assuming there are no additional planet-planet interactions. Kepler s laws of orbital motion give us the necessary relationships to describe the orbital parameters and physical properties such as mass and semi-major axis of the planet. Since the orbital inclination with respect to our line of site cannot be resolved by the radial velocity technique alone, only the mass of the planet modulo sin(i) is calculated. The radial velocity wobble of a star can be described in terms of Keplerian orbital parameters by, v(t) = K[cos(ν(t) + ω) + e cos ω] + γ (1) where K is the semi-amplitude, the true anomaly ν(t) is the angle describing the position of the star in time in the plane of the orbiting planet, e is the eccentricity of the planet, and ω is the argument of periastron with respect to the line of nodes (Taff 1985). The parameter γ is a radial velocity offset applied to all data points. 2

10 The semi-amplitude K(m/s) is defined by, K = 2π P am planet sin i (M star + M planet ) 1 e 2 (2) where P (days) is the period of the planet, a(au) is the semi-major axis, and i is the inclination of the orbital plane with respect to the plane of the sky (Taff 1985). Figure 1 shows the orientation of a typical orbit. As shown the line of nodes is the intersection between the orbital plane and the plane of the sky, or reference plane. The periastron is the point in the orbit closest to the center of mass, or focus of an elliptical orbit. The rotation of the orbit in the plane of the sky is described by the longitude of the ascending node Ω, defined with respect to the zero point of longitude. This angle, measured in the plane of the sky, is not resolved by the radial velocity technique. Figure 1: The first figure shows a typical orbit inclined by an angle i with respect to the plane of the sky, or reference plane. The ascending node is the intersection of the orbital plane and the plane of the sky. The argument of periastron ω is the angle from the ascending node to the line of apsides in the orbital plane, which contains the perihelion and aphelion. The longitude of the ascending node Ω describes the rotation of the orbit in the plane of the sky. The second figure shows an ellipse (bold cruve) and a circle (thin curve). The center of the circle is labeled c, the focus of the ellipse is labeled s, and the location of the orbiting body is labeled p. The true anomaly ν is labeled T, and the mean anomaly M and eccentric anomaly E are also labeled. The point z labels the point of perihelion. (Figures taken from scienceworld.wolfram.com/physics/orbit and en.wikipedia.org/wiki/true anomaly.) To describe an eccentric orbit in terms of times of observation three angular quantities are needed: ν(t) the true anomaly, E(t) the eccentric anomaly, and 3

11 M(t) the mean anomaly. The true angular location of the planet around the focus, ν(t), is defined with respect to the line of apsides. M(t) is an angle measured from the center of the ellipse, not the focus, and points towards the mean location of the planet in time on a circle of radius a. M(t) can be interpreted as the average angular location of the planet given an average angular velocity, or mean motion n. E(t) is also an angle measured from the center of the ellipse to a point on a circle of radius a. E(t) describes the point on the circle that is intersected by a line normal to the line of apsides, which also contains the position of the planet on the ellipse. The true anomaly is defined by ( ) ν(t) = 2 tan e 1 e tan E1 (t) (3) 2 where E 1 (t) is an estimation of the eccentric anomaly. The eccentric anomaly can be estimated by, E 1 (t) = M(t) + e sin M(t) 1 sin(m(t) + e) + sin M(t) (4) and further refined using Newton-Raphson iteration (Taff 1985). The mean anomaly M(t) is defined in terms of the time of periastron passage t p by [ t tp M(t) = 2π P ( )] t integer tp P (5) For a circular orbit, e = 0, ν(t) = M(t) = E(t) for all t. For an eccentric orbit, in the time between periastron and apastron ν(t) E(t) M(t). In the time 4

12 between apastron and periastron these relations are reversed. Typically six orbital parameters, such as (P, K, e, ω, t p, γ), are free parameters when fitting a Keplerian model to radial velocity data. Due to the non-linearity of these equations, solving for the best-fit or most likely orbital parameters given a set of observations times t and amplitudes v(t) must be an iterative process. 3 Orbit Fitting Methods for Radial Velocity Data The statistical techniques commonly applied to non-linear parameter estimation are either frequentist or Bayesian. The primary difference between these approaches is that the frequentist method requires integration over the distribution of possible data sets, or hypothetical realizations of the actual data, whereas the Bayesian method requires integration over the parameter probability distributions (Loredo 1999, Robert 1999). The frequentist method yields a ensemble of best-fit parameter values by sampling from the distribution of possible observations. The Bayesian method yields the parameter probability distributions directly. The frequentist approach posits an optimization problem: if the maximum likelihood estimator (MLE), such as the χ 2 statistic, is optimized then the parameters of maximum likelihood have been identified. A minimization of the χ 2 statistic is also known as a least-squares approach, and is interpreted as having found the best-fit parameters. This approach is reliable for parameter fits with a number of parameters N 4. When applied to models where N > 4 the least-squares approach, as an approximation to the maximum likelihood, becomes 5

13 computationally more challenging (Loredo 1999). Typically a numerical method, such as Levenberg-Marquardt, is used to optimize the MLE, while a simulation method such as bootstrapping, is used to estimate the range of the most likely parameters. The Bayesian approach is primarily concerned with computing probabilities for a set of model parameters, known as posterior probability distributions. Inferences regarding the probability for a parameter value to fall within a certain range of values is easily extracted from these probability distributions using common statistical measures. There is no single best-fit set of parameters but for relatively narrow probability distributions the parameters corresponding to the peak are considered most probable. The Bayesian technique has a natural way of including apriori knowledge about the physical model in the form of prior probability distributions. There is no analogous feature in the bootstrap method to include prior constraints from the physical model. The Bayesian paradigm also allows one to compare the probability of multiple models given a set of observations, for example the probability of a data set corresponding to a two planet model compared to a single planet model, by calculating the Bayes factor (Gregory 2005, Robert 1999). 3.1 Levenberg-Marquardt Monte Carlo Method A common approach to solving Kepler s equation for radial velocity data and thereby estimating the orbital parameters of an unseen companion is the Levenberg- Marquardt (LM) least-squares fitting routine (Press 1992). The LM algorithm uses a maximum likelihood approach to answer the question: How likely is the 6

14 data set given a set of orbital parameters? This routine minimizes the χ 2 maximum likelihood estimator by using a gradient search through parameter space. For a given radial velocity data set, where v obs (t k ) and σ k are the observed velocities and error measurements of the k th observation, χ 2 is defined as χ 2 = k (v obs (t k ) v model (t k )) 2 σ 2 k (6) where v model (t) is a function of the orbital parameters. By minimizing χ 2 the LM routine asymptotically approaches the maximum likelihood estimate of the orbital parameters. The least-squares approach to orbit fitting is efficient for wellsampled data sets with good phase coverage, corresponding to a dominant global minimum in χ 2 -space. The concern with least-squared approaches is that for data sets with less than optimal sampling or poor phase coverage, corresponding to many local minima in χ 2 -space, the LM routine is sensitive to initial guesses and is easily trapped. There is no accurate way to estimate the uncertainty in the fitted parameters with the LM routine alone so we use the common method of bootstrapping, or residual refitting, to provide a Gaussian approximation of the parameter distribution about the best-fit value (Press 1992). The details of a single LM fit are as follows. After choosing initial parameter values the parameters are perturbed iteratively until a minimum in χ 2 is found. At each iteration the LM algorithm calculates the χ 2 current value. Then parameter perturbations are calculated as a function of the gradient of χ 2 current and a step size coefficient λ. Next χ 2 new of the new parameters is calculated. If χ2 new < χ 2 current then the new parameters are accepted and λ is increased so that the 7

15 magnitude of the next parameter perturbation is larger. If χ 2 new > χ 2 current then the new parameters are rejected, λ is decreased, and a new set of parameter perturbations are calculated. The LM continuously switches between a linear and parabolic stepping scheme. In regions where χ 2 changes drastically a steepest descent method is used, and in regions where χ 2 changes more gradually an inverse-hessian method is used (Press 1992). We choose to stop fitting when χ new χ current χ current < 10 6 (7) which indicates that the minimum has been found. This is considered a single LM fit. An LM fit is guaranteed to find at least one local orbital solution, but for some data sets there can be distinct local minima in χ 2 space that provide equally good fits. In an attempt to avoid getting trapped in a local minimum we start with various initial periods guesses. Periodicities in the data are identified initially by a periodogram (Horne & Baliunas 1986). The period with the highest power is then used as an initial guess in an LM fit. If the peak period power is split between two periods then both periods are used as initial guesses to two separate LM fits. In this case the initial period guess that gives the lowest χ 2 fit is used as the initial period guess. Once the best period guess is determined, one additional LM fit using this initial period is calculated to obtain the best-fit LM solution. The LM routine only returns a single set of best fit parameters. Estimating the distribution of the fit parameters requires Monte Carlo residual fitting, also 8

16 known as the bootstrap method (Press 1992). The bootstrap method involves refitting to new realizations of radial velocity data by adding back the scrambled residuals from the best-fit solution. The residuals of the best-fit solution are the difference between each observed radial velocity data point and each model data point, the latter being a function of the best-fit parameters. This is called LM Monte Carlo (LMMC) fitting. The LMMC routine runs thousands of fits, each time scrambling the residuals randomly amongst the dates of observation and using the LM routine to fit to the newly created radial velocity data. In the end the LMMC routine returns distributions of parameter values that represent the best-fit parameters of a sample of hypothetical observations. The standard deviation of this parameter distribution gives the one-sigma uncertainty for each fit parameter. The precision of the parameter distributions is calculated every few thousand steps and the routine is halted once sub-sets of each parameter array agree to 1%. This is based on two statistical measures: 1. the standard deviation of the mean of each parameter sub-set, 2. the mean of the standard deviation of each parameter sub-set. We demand that the distributions agree to 1%, which is attained for each parameter when each of the two statistical measures are less than 1% of the mean and standard deviation, respectively. 9

17 3.2 Markov-chain Monte Carlo Method The Markov-chain Monte Carlo (MCMC) method similarly calculates parameter distributions given a radial velocity data set, but MCMC samples directly from the parameter posterior probability distributions. MCMC is a Bayesian technique, where the goal is to calculate the probability distributions for a set of parameters given a set of observations and a physical model. In our case the observations are radial velocity observations and the the single-planet Keplerian model defines the orbital parameters and physical properties of the planet. A Markov chain is a series (i.e. chain) of parameter fits (i.e states) such that each state depends only on the previous state. A convergent Markov chain is interpreted as the posterior probability distribution for each parameter, and must be allowed to sampled all regions of parameter space with non-zero probability. A Markov chain is guaranteed to have converged to the posterior distribution if it satisfies three properies: aperiodicity, irreducibility, and reversibility. The aperiodicity requires that the chain not contain any recurrent periodic loops between any two states. The irreducibility requires that the chain have a non-zero probability of getting to any state from any other state. The reversibility requires that the probability be non-zero for the chain to return to any previous state (Gilks 1996). The reversibility allows the target distribution to be stationary. As will be shown, these chains explore parameter space more fully than LMMC and often report broader parameter distributions. The probability distribution, or posterior probability distribution, of the orbital parameters given the data set is denoted by p( x d), where x = [log P, log K, e, ω, t p, γ] and d is the radial velocity data. We have chosen 10

18 to search period and semi-amplitude space logarithmically so as not to exclude values an order of magnitude or more from the starting point of the chain. Each d in our case is a three dimensional array of dates of observation, mean radial velocities, and radial velocity error estimates. The statement p( x d) can be read as the probability of x conditional on d. Bayes Theorem relates the probability of x and d given model assumptions A by p( x, d A) = p( x)p( d x, A) (8) = p( d)p( x d, A) (9) where the product rule for probabilities is used, p( x) is the prior probability of the orbital parameters given by the physical model, and p( d, x A) is the joint probability of x and d. The quantity p( d x) is the likelihood of the data given a set of parameters. We will assume all probabilities are conditional on the model assumptions A and drop it for convenience. Bayes Theorem provides a way to calculate the posterior probability distribution by solving for p( x d) in Equation 9 p( x d) = p( x)p( d x) p( x)p( d x)d x (10) where we have represented the likelihood as an integral of p( x)p( d x) over all possible parameter values. This integral is a constant for a given d, so in principle can be ignored for the purposes of parameter estimation. The priors p( x) contain the mathematical constraints on the parameters as defined by the model, in this case a single Keplerian model. The priors used are 11

19 listed below. Period: uniform in log P from to +. Semi-Amplitude: uniform in log K from to +. Eccentricity: uniform from 0 to 1. Argument of Periastron: uniform from 0 to 2π. Time of Periastron Passage: uniform within the first full orbital period. Amplitude Off-Set: uniform from to +. Each prior is enforced at every state in the chain. The main computational task in obtaining the posterior probability distribution is to sample the posterior sufficiently such that the sample converges to the posterior. To compute this we use the MCMC simulation with the Metropolis- Hastings (MH) algorithm, described below. The MCMC routine creates a chain, or series of states each containing an orbital solution. Each state in a chain is an array of six orbital parameter values denoted x i for the i th state. Given an initial set of parameter values x 0, a set of priors p( x), and a transition probability p( x n+1 x n ), a Markov chain (as discussed above) is guaranteed to converge to the posterior probability distribution if the chain is aperiodic, reversible, and irreducible. In order for a chain to meet these requirements the transition probability function must be chosen carefully. The MH algorithm is used because its transition probability function meets 12

20 these criteria. It uses a transition probability function of p( x x) = q( x x)α( x x) (11) where x is the current state of parameter values and x is a trial state of new parameter values. Equation 11 determines the probability of the current state accepting a transition from the previous state to the trial state or rejecting a transition and keeping the previous state. This transition probability depends on the candidate transition probability function q( x x) which controls the step sizes of each concurrent state, and the acceptance probability α( x x) which controls the probability of the acceptance of a specific step in parameter space. The candidate transition probability function is chosen to be a six dimensional normalized Gaussian distribution with specifically chosen full-width at half-max β µ values, where µ refers to one of the six orbital parameters. The value of β µ is fixed for each chain and each parameter. The process for calculating this value before running the MCMC is discussed below. For each state in the chain a value of q( x x) is drawn from this distribution at random and used to create the trial state, as described below. The acceptance probability, as defined by the MH algorithm, is [ α( x x i ) = min exp ( χ 2 ( x i ) χ 2 ( x ) ] ), 1 2 (12) such that α is always a number between 0 and 1. The MH algorithm updates each new state x i+1 from the current state of the chain x i in five steps as laid out by Ford (2004), and are listed below. We note that we have chosen to propose a change of only one parameter at random per 13

21 iteration, and that one state consists of 50 iterations. This is to ensure that each parameter has had an opportunity to change between each state. 1. Create a trial parameter state x by adding to the current state x i a draw from the candidate transition probability function q( x x i ). A draw from the center of this distribution would correspond to no change in the parameters. 2. Calculate χ 2 ( x ), the goodness of fit of the trial parameters. To do this we create radial velocity values as a function of x and then calculate the sum of squares of the difference between each trial velocity and observed velocity. We also need the current state s goodness of fit χ 2 ( x i ). 3. Draw a random number u from a uniform normal distribution. 4. If u α( x x i ) then the trial parameters are accepted, x i+1 = x. If u > α( x x i ) then the trial parameters are rejected and the current state is kept, x i+1 = x i. According to equation 12 if the trial parameters produce a better χ 2 fit to the data then they are always accepted. If the trial parameters give a slightly worse fit there is a nonzero probability that some will still be accepted. However, the worse the trial fit compared to the current fit the less likely that the trial parameters will be accepted. 5. Update the chain with the result from step #4, go to the next state in the chain, and repeat from step #1. How big can χ 2 be on average for a trial state to still be accepted? If the random number is u = 1/2 one can calculate the largest possible change to be χ 2 =

22 3.2.1 Transition Probability Function Many integrals in Bayesian problems were largely incalculable before computational technology reached its current capabilities. Despite these improvements, maximizing the efficiency of a Markov chain is often one of the main practical challenges. As hinted at above, the transition probability function, and more specifically the β µ parameter, is critical to the efficiency of the MH algorithm. If β µ is too large then the proposed parameter steps will be so large as to rarely be accepted, as in step #4 above, and the chain will be slow to converge. However, if β µ is too small then the proposed parameter steps will be so small that almost all the proposed steps will be accepted and the chain take an excruciatingly long time to explore all of relevant parameter space. It is also possible if β µ is too small for the chain to get trapped in a local minimum of χ 2 -space. For these reasons we have adopted the strategy of Ford (2006) to carefully choose each β µ parameter before running each MCMC such that each parameter sub-chain is accepting proposed parameter steps approximately 45% of the time. The β µ parameters depend intimately on the radial velocity set, and so must be calculated anew for each radial velocity set. Our approach to this problem is to run a preliminary pseudo-mcmc where the β µ parameters are changed iteratively until the acceptance rates are within the preferred range. We call it a pseudo-mcmc because in a real MCMC run the β µ parameters are required to be fixed in order for the chain to have a chance to converge. The pseudo-mcmc s run on the order of 10 4 states, but may be run longer depending on the variability of the acceptance rates. Once the preferred β µ parameters are found such that 15

23 the acceptance rates are stable about 45% for several thousand states then the pseudo-mcmc is stopped and the β µ parameters are saved. The full MCMC is then started using these values. One difficulty we have found with this stage in the fitting process is that the ability of the pseudo-mcmc to find the preferred β µ values depends on the initial guesses for all parameters. In general when the acceptance rate is very different than 45% the β µ is changed logarithmically, and when the acceptance rates are closer β µ is changed linearly. It may help to start with a poor guess for x 0 and smaller initial β µ values such that the initial acceptance rates are high, which can lead to a more efficient identification of the desired parameters. An example of the convergence of the acceptance rates from a pseudo-mcmc run is shown in Figure 2. The acceptance rates of each of six orbital parameters are plotted for each step in the pseudo-mcmc, and the routine stops once each acceptance rate is 45%. This plot shows the first of two calls to the pseudo- MCMC to calculate the β µ values. These values are then passed to the MCMC routine, where they are fixed quantities. 16

24 Figure 2: Plotted are the pseudo-mcmc acceptance rates for a chain of parameters. The routine is fitting to the g1 data set (see Table 1). The step size parameter for each curve βµ is perturbed until the acceptance rate is 45%. This is done before running the MCMC to increase its efficiency. The dots that make up the thin lines are the acceptance rates for a single step. The thick lines (starting at step 1000) are the acceptance rates averaged over a window of 20% of the previous values and are used to calculate the perturbations of βµ. 17

25 3.2.2 Convergence Test Following all preliminary pseudo-mcmc runs, the MCMC has a initial burn in period of several thousand states, and then several thousand more before the routine begins checking the chain for convergence. The MH iteration will continue until the chain converges at least twice to the target distribution, or posterior. Once the chain has converged for the first time we continue the chain, periodically testing for a second convergence, at which point the chain is stopped. This is to ensure that the chain has infact converged to the posterior distribution and that the success of first convergence test was not spurrious. The convergence criteria used here is the Gelman-Rubin statistic ˆR (Gilks 1996). This statistic is calculated separately for each orbital parameter, and the discussion that immediately follows is an application for a single orbital parameter. Consider m chains each of length n, where x ij refers to the j th element of the i th chain. To calculate ˆR we first need to calculate the between-sequence variance B and the within-sequence variance W by, where n m B = ( x i. x.. ) 2 (13) m 1 i=1 W = 1 m s 2 i (14) m i=1 x i. = 1 n x ij n j=1 (15) x.. = 1 m x i. m i=1 (16) 18

26 s 2 i = 1 n 1 n (x ij x i. ) 2 (17) j=1 The convergence statistic is defined as, ˆR = < var(x) > W (18) where, < var(x) > = n 1 n W + 1 n B (19) Qualitatively this statistic is the ratio of the variance within to the variance between two chains. If multiple chains have converged to the same stationary posterior probability distribution then these variances will be comparable and ˆR will approach 1. We chose an accuracy of 1% for all chains, corresponding to ˆR = Testing Fitting Techniques with Synthetic Radial Velocity Observations In order to objectively evaluate the performance of these techniques and to better understand the process of orbit fitting we fit to synthetic radial velocity sets created from chosen orbital solutions. Three classes of synthetic radial velocity data sets are created and tested. 19

27 4.1 Synthetic Data Sets The first class includes 57 synthetic data sets with the chosen solution P = 80 days, K = 40 m/s, e = 0.2, ω = 170, t p = days, γ = 10 m/s, which corresponds to a minimum mass of 0.83 M J and a semi-major axis of 0.36 AU. This is our back of the book true solution. Each data set in class 1 (see Table 1) has a unique set of observational parameters, including varying phase coverage, number of observations, time of first observation, and photospheric radial velocity jitter. Each data set spans one of three fractional phase coverages: 0.75, 1, or 1.5, of a full period. Each set has either 18 or 30 observations per orbital period. Each set has one of three times of first observation, with respect to t p : 20, 0, or +20 days. Each set has one of three mean photospheric jitter σ jitter values: 0, 2, or 5 m/s. The photospheric jitter parameter attempts to model the potential contamination of the Doppler shift measurements due to flows and inhomogeneities over the surface of the star, which can be large for active stars. The amount of stellar jitter can be predicted somewhat by monitoring the Calcium II H & K lines (Wright 2005). 20

28 Data Set Phase Observations σ jitter (m/s) Start Date Coverage t 0 (d) 1,2,3 a1, a2, a (18) 0-20, 0, +20 4,5,6 b1, b2, b , 0, +20 7,8,9 c1, c2, c , 0, ,11,12 d1, d2, d3. 22 (30) 0-20, 0, ,14,15 e1, e2, e , 0, ,17,18 f1, f2, f , 0, ,20,21 g1, g2, g (18) 0-20, 0, ,23,24 h1, h2, h , 0, ,26,27 i1, i2, i , 0, ,29,30 j1, j2, j3. 30 (30) 0-20, 0, ,32,33 k1, k2, k , 0, ,35,36 l1, l2, l , 0, ,38,39 m1, m2, m (18) 0-20, 0, ,41,42 n1, n2, n , 0, ,44,45 o1, o2, o , 0, ,47,48 p1, p2, p3. 45 (30) 0-20, 0, ,50,51 q1, q2, q , 0, ,53,54 r1, r2, r , 0, ,56,57 s1, s2, s (30) 0-20, 0, +20 Table 1: Class 1 synthetic observations. A total of 57 synthetic data sets with fixed orbital parameters (P = 80d, K = 40m/s, e = 0.2, ω = 170, t p = 12500d, γ = 10m/s). Data sets vary in phase coverage, number of observations total (number of observations per period), σ jitter, and start date with respect to t p. 21

29 Each synthetic data set is labeled alphabetically by period, sampling, jitter, and then numbered 1 to 3 corresponding to the three possible dates of first observation. The second class includes 21 synthetic data sets (see Table 2) with varying orbital eccentricity values from e = 0.0 to e = 0.6, with all other orbital parameters unchanged, σ jitter = 5m/s, and t 0 = t p 20d. The third class includes 4 data sets (see Table 3) of synthetic observations of Jupiter (P = d, K = m/s, e = ) with σ jitter = 0m/s, and varying phase coverage, density of observations, and time of first observation. These 4 data sets are analogous to a1, d1, g1, and j1 of class 1, except for the change of P, K, and e. This is for the case of Jupiter viewed edge-on (i = 90 ). We also include an observational velocity error σ obs = 3 m/s for all synthetic data sets. The radial velocities and error values are defined as, v(t) = v obs (t) + nσ vel σ vel = ± σjitter 2 + σobs 2 (20) where the observational and jitter error are added in quadrature. For each time of observation the velocity error parameter σ vel is multiplied by a draw n from a normalized Gaussian distribution and added to the observed velocity v obs (t). The error reported with each synthetic data set only includes the observational error σ obs. The initial and final observation times are fixed for each radial velocity set, while the rest of the observation times are random within these limits. We also 22

30 take seasonal observational constraints into account by randomly choosing a four month block of the year without observations. Seasonal constraints play a minor role in the 60 to 160 day data sets. In the following sections we present both MCMC and LMMC results for each class of synthetic observations, see Figures MCMC Simulations An MCMC run consists of three simultaneous chains, each consisting of six orbital parameters, with a length on the order of 10 5 states, and a burn-in period on the order of 10 4 states. Each MCMC run requires at least two preliminary pseudo- MCMC runs to calculate the β µ parameters that give acceptance rates for each parameter of 45%. An example of the convergence of the acceptance rates for the first of two pseudo-mcmc runs is shown for data set g1 in Figure 2. The number of pseudo-mcmc runs required to find the β µ values may be more than two if the data set is particularly difficult. Every MCMC run presented below has converged twice, with the criteria that R 1.01 for all orbital parameters. 4.3 LMMC Simulations Each subset of an LMMC result agrees to 1%. The LMMC requires fewer fits in general than the MCMC because the MCMC explores a broader range of parameter space. It is difficult in general to compare the computation times because a single LM fits requires more computation time than a single MCMC state. Since the LMMC routine requires more computation time per step (by about a factor of 20) we consider an LMMC ensemble of parameters to have converged to the 23

31 choosen accuracy as soon as 11 out of 12 of the convergence criteria are meet (6 means and 6 standard deviations), as described in Section 3.1. Therefore one parameters means or standard deviations have not yet necessarily reached 1% accuracy, while all others have. We do not expect this to be detrimental when qualitatively comparing parameter distributions. 4.4 Class 1: Varying the Observational Parameters Class 1 contains the bulk of the synthetic data sets tested by these two techniques. Figure 3 shows the period distributions from MCMC fits to all class 1 data sets. The distributions are represented individually by vertical contour histograms (histograms as seen from above). As shown in the legend, darker regions correspond to areas of high probability in period space. Sharply peaked period distributions correspond to a confidently estimated value of the period, the peak value of this distribution being the most probable. 24

32 a a2a3b1b2b3c1c2c3d1d2d3e1e2e3 f1 f2 f3 g1g2g3h1h2h3 i1 i2 i3 j1 j2 j3 k1k2k3 l1 l2 l3m1m2m3n1n2n3o1o2o3p1p2p3q1q2q3 r1 r2 r3 s1s2s3 0.10% 0.20% 0.50% 1.0% 2.0% 4.0% 8.0% 9.0% 10.% 11.% 12.% 13.% 14.% 15.% 18.% Period Figure 3: Contour histograms of the period posterior distributions of MCMC results for all class 1 synthetic data sets. Normalized probability levels are shown in the legend for all histograms. 25

33 Clearly for data sets m1 through s3 the period, and all other orbital parameters, are well constrained. This is attributable to the long phase coverages, 1.5 and 2 full orbital periods, of these data sets, and is independent of the density of observations, jitter, and start date. The next step down in phase coverage, 1.0 full orbital period, corresponding to data sets g1 through l3, show a more sporadic success rate in identifying the correct, or any confident value of the period. The most striking pattern in these distributions is the flatness of all sets with start dates of t 0 = t p, that is all sets ending in 2. In fact this pattern is found in all orbital parameter distributions with t 0 = t p and a phase coverage of 1 or less, with the exception of c2 (see below). This trend can be understood in terms of the lack of coverage of either the periastron or apastron, and in our case number of peaks and troughs in a radial velocity curve (when ω = 180 t p occurs at the minimum of the radial velocity curve). If observations occur before, during, and after both periastron and apastron then the period will be relatively well constrained, even if the data spans less than one full orbital period. For t 0 = t p these data sets can be difficult even with a full period if their are stranded observations at the edge of the period, which are given more weight in the orbital fit. Especially if these edge points happen to be shifted due to large noise, as is the case with some of i and l data sets where σ jitter = 5 m/s. Next to the phase coverage and occurence of t 0 with respect to t p, the additional jitter component appears to be the next most dominant observational variable. This is seen by comparing the relative widths of sets g and j (σ jitter = 0) to l and i (σ jitter = 5), the latter appearing wider than the former. 26

34 The last group with a phase coverage of 0.75 (far left in plot) contains the most difficult data sets to model. The period distributions for sets a1 through f3 are very flat, with peaks rarely reaching the 9% level. Many of these distributions extend out to several orders of magnitude in period space, providing little or no information about the actual period. As expected the hardest data set to model is c2, with the highest jitter and lowest density of observations. The MCMC results, however, give an unexpectedly confident estimation, but of the wrong period! We will explore this oddity in more detail below (see Section 4.4.1). Figure 4 shows the MCMC and LMMC peak and median period values for all class 1 data sets. The horizontal black line shows the true orbital period P = 80 d. In general the peak period values appear to be closer to the correct answer more often than the median values, especially in the case of the MCMC points. For well constrained periods, sets m1 to s3, there is little difference. Any difference between the median and peak value of a probability distribution can be attributed to its non-gaussian or asymmetric shape. For the difficult data sets a1 to f3 (#1 to #18), occasionally the peak or median period value falls close to the correct answer by chance. Therefore it is important to report the large uncertainty in this value, which is the standard deviation of the distribution. 27

35 Figure 4: MCMC (red) and LMMC (blue) period values, median (x) and peak (dot), for all class 1 synthetic data sets. True value of P = 80d is drawn as horizontal black line. 28

36 The LMMC period distributions are tighter and more symmetric than the MCMC, as apparent in the difference between the peak and median values, because each LM residual fit starts with a good guess for the period (from a periodogram), and additional residual fits stray little from this region in period-space. As mentioned above as a potential problem with the LMMC, if there is a deep local minimum in χ 2 space close to the initial guess then the LM routine may not explore other parts of parameter space where there may be a better fit. On the contrary, MCMC places less emphasis on the χ 2 statistic and is designed to to explore all of parameter space independent of the inital values. This leads to wider distributions in general. Figures 5, 6, 7, 8, and 9 show the MCMC and LMMC semi-amplitude K, eccentricity, ω, t p, and γ peak and median values, respectively. Each plot displays the same trend of tighter values when there is better phase coverage (left to right in plots). Most outlying points in these charts correspond to data sets with start dates of t 0 = t p, which can be attributed to the difficulty of constraining the period as discussed above. The scatter of parameter values about the correct answer in each orbital parameter plot cannot be directly compared to other parameters because the scales are different. However, the t p plot (Figure 8) seems to show a larger scatter for LMMC values for data sets with t 0 = t P compared to the MCMC values. This is especially the case for data sets h2 (#23) and i2 (#26) where t p was inaccurately estimated. The MCMC does a better job with these cases possibly because the MCMC relies on the actual data set, whereas the LMMC adds residual noise to observations near the edge of the period. For t 0 = t p these 29

37 observations near the edge are critical and by adding residual noise the LMMC is making it much more difficult for the fitting routine to consistently identify the same trend near the edge of the curve. We plot χ 2 evaluated at the peak MCMC parameter values and at the best-fit LMMC parameter values in Figure 10 for comparison. The peak MCMC parameter values do not necessarily correspond to the minimum χ 2 value so the MCMC values (red) are higher in general. The χ 2 values are plotted on the y-axis logarithmically, and are calculated using both the synthetic (dot) and true (asterisk) radial velocity values in equation 6. The true radial velocity values are the synthetic values before the noise term is added, nσ in equation 20. Again there is a trend of lower χ 2 for data sets with better phase coverage (left to right in plot), for both the LMMC best-fit and MCMC peak parameters. If the χ 2 true (asterisk) value is lower than the χ 2 synth (dot) value then the fitting technique was able to overcome the noise added to the signal. This is the case for almost all data sets with at least 1 full phase coverage (#19 and up). 30

38 Figure 5: MCMC (red) and LMMC (blue) semi-amplitude K (m/s) values, shown as median (x) and peak (dot), for all class 1 synthetic data sets. True value of K = 40m/s is drawn as horizontal black line. 31

39 Figure 6: MCMC (red) and LMMC (blue) eccentricity values, median (x) and peak (dot), for all class 1 synthetic data sets. True value of e = 0.2 is drawn as horizontal black line. 32

40 Figure 7: MCMC (red) and LMMC (blue) argument of periastron ω (degrees) values, median (x) and peak (dot), for all class 1 synthetic data sets. True value of ω = 170 is drawn as horizontal black line. 33

41 Figure 8: MCMC (red) and LMMC (blue) time of periastron passage t p (d) values, median (x) and peak (dot), for all class 1 synthetic data sets. True value of t p = 12500d is drawn as horizontal black line. 34

42 Figure 9: MCMC (red) and LMMC (blue) amplitude off-set γ (m/s) values, median (x) and peak (dot), for all class 1 synthetic data sets. True value of γ = 10(m/s) is drawn as horizontal black line. 35

43 Figure 10: Plot of χ as a function of peak for MCMC and best-fit LMMC orbital parameter values. The χ values are plotted on the y-axis logarithmically, and are calculated using both synthetic (dot) and true (asterisk) radial velocities values. Observational error measurements are included in all χ values so many χ true values are <

44 4.4.1 Worst Cases: Data Sets c1, c2, and c3 Figure 11 shows the radial velocity plot and each parameter histogram for c1, c2, and c3, vertical black lines correspond to the true parameter answer. One can easily understand by glancing at the radial velocity observations why these three data sets are difficult to measure: low phase coverage (0.75), low density of observations (18), and high jitter (5 m/s). As mentioned above, the MCMC period distributions (red) for c2 confidently estimate the wrong orbital period, and as apparent in Figure 11 the LMMC fails similarly. Data set c1 also shows this surprising result. For c2 the LMMC and MCMC disagree dramatically in their estimation of the probability of the 60 day period. This may be attributable to the single radial velocity observation at the end of the c2 data set, which likely has been shifted down due to the large jitter. In the MCMC method this data point is never changed, thereby providing a constraint on the upper limit of the period distribution. The sharp probability peak at the lower period limit may be an effect of both upper and lower period constraints. The LMMC routine perturbs this data point when adding on the residuals of the best fit, decreasing the emphasis that the LM routine places on the exact observed amplitude value, and allowing for a larger range of periods. For the case of this data set the LMMC better represents the true period, or at least shows less confidence in the wrong period. The MCMC might improve for this data set if the true jitter of the star was accurately represented in the observational error bars, which would de-emphasize the amplitude of last observation and allow for longer period fits. For data set c3 the MCMC again peaks at 60 days but the LMMC peaks at 37

45 30 days. The LMMC peak is different than the actual best-fit period to the data, which is 62 days, and it is surprising that the LMMC peak period is so far from the best-fit period. This may be explained by the fact that the c2 data set has a large gap at apastron (peak in the radial velocity curve), which could allow a 60 day signal to appear like a 30 day signal. The period distributions are all shifted towards lower periods, with peaks around 60 or 30 days. Notice that to the left of these peak periods there is a sharp drop, insinuating that periods lower than these peaks are significantly worse, but that longer periods are still possible. This shows up as a long tail in the distributions, which is cut off in these plots at 140 days. The MCMC results for the period of c2 are so sharply peaked at 60 days, and the LMMC for c3 at 30 days, that one might be tempted to declare that this must be the peak period value. This would be a case of relying too heavily on the parameter probability distributions to determine whether the parameters are in fact a good fit to the data, and ignoring the actual goodness of fit statistic, χ2. Figure 10 shows that for the c2 peak parameter values χ 2 3, insinuating that it is in fact not such a good fit. We discuss the relative importance of the parameter distributions and the χ 2 value in the Discussion section below. 38

arxiv:astro-ph/ v1 14 Sep 2005

arxiv:astro-ph/ v1 14 Sep 2005 For publication in Bayesian Inference and Maximum Entropy Methods, San Jose 25, K. H. Knuth, A. E. Abbas, R. D. Morris, J. P. Castle (eds.), AIP Conference Proceeding A Bayesian Analysis of Extrasolar