Development of an algorithm for the problem of the least-squares method: Preliminary Numerical Experience

Development of an algorithm for the problem of the least-squares method: Preliminary Numerical Experience Sergey Yu. Kamensky 1, Vladimir F. Boykov 2, Zakhary N. Khutorovsky 3, Terry K. Alfriend 4 Abstract We consider methods of minimizing quadratic forms with the least-squares technique (LQP). We demonstrate that the success of the commonly used variations of the Gauss-Newton methodology is based on employing considerations which are not related to the structure of the partial derivatives matrix. At the same time, a singular-value decomposition (SVD) of this matrix permits to estimate the dimensionality of the space for which the minimization can be successful at each step. We use orbit determination for a half-day, highly elliptical satellite type Molniya as an illustration. The orbit is built using optical measurements. The initial guess is an orbit with its plane turned 140 degrees from the actual position (i.e., almost the maximum possible deviation). This problem is considered to be a preliminary stage in the numerical studies of the convergence area for orbits of the Molniya type. Comparison of the Gauss-Newton results with variable steps and the SVD technique demonstrates the advantages of the latter. These results permit the creation of an algorithm, for which convergence is all but guaranteed and does not depend on the initial guess. Additional efforts needed in SVD do not represent a serious obstacle, given the available computational speeds of hundreds of teraflops. Introduction Given in this section is a brief review of the two main types of the minimization methods for the least-squares problem. A suite of the classical methods for minimization of LSP Presented in Fig, 1 are several techniques for minimization of a function with the least squares method. A brief description of these techniques is given below. 1. Let us consider the main techniques of solving the Gauss-Newton equations[1]: We will use to denote the symmetric matrix in the left-hand side of the equation. It can be proven with simple calculations that this symmetric matrix can be represented as, where is the lower triangular matrix. This expansion is called the Kholessky expansion. The system can be easily solved by solving two triangular systems consecutively 1 Vympel International Corporation, 101000, Moscow, Russia, Chief designer 2 Vympel International Corporation, 101000, Moscow, Russia, Lead scientist 3 Vympel International Corporation, 101000, Moscow, Russia, Manager section 4 TEES Distinguished Research Chair Professor, USA, Texas A&M University

Fig 1. A suite of the classical methods for minimization of LSP Unfortunately, if the matrix P is not well defined, rounding errors can lead to small negative numbers on the diagonal. This is why a modified Kholessky technique was proposed. In this method, not the original matrix P, but a corrected one is built: in such a way that: All the elements of the diagonal matrix D are significantly positive, Absolute values of all the elements of the triangular matrix are uniformly bound from above. To satisfy these conditions, small additions are done to the matrix in the factorization process if needed. As a result, a corrected matrix is obtained instead of the original one, with a small difference (a diagonal matrix E). However, if the matrix is very poorly defined, the nonlinearity of the function which is being minimized starts playing a noticeable role. Indeed, if the matrix P is diagonalized by the an orthogonal transformation:

, then the solution is given as: The components corresponding to the small eigenvalues g i will have big components and the behavior of the function at such large distances from the initial point will not correspond to the linear approximation employed. Several methods were suggested to avoid getting into the area of poor approximation of the minimized function. 2. Minimization with respect to only some of the variables. It is assumed that the user knows that the function being minimized contains variables which affect the value of the function very significantly (or which are known only very roughly) and less important variables (or those known more precisely). Then the minimization proceeds in two stages. First, minimization with respect to the rough variables is done. Then the second stage, which may be absent in simple cases, the exact variables are taken care of. This idea is used for a great number of algorithms for specific cases of orbit refinement using only some variables. Since the efficiency of these algorithms depends on the initial knowledge of which variables belong to which group, they are usually used for a short time interval orbital refinement, when one can utilize knowledge of the accuracy of the initial measurements. 3. Normalization of the function which is being minimized. It is assumed that the user knows that the function has large derivatives with respect to some variables and small ones for the rest. Then an auxiliary function is produced, in which the weight of the variables with larger derivatives is reduced, and the others are weighed more heavily. The justification is that one can ignore fast derivatives in the beginning, since they can always be easily treated, for example, with the fastest descent minimization. 4. The ditch method proposed by I. M. Gelfand, a corresponding member of the Russian Academy of Sciences. It is assumed that there are directions of fast decrease of the value of the function which can be calculated with high accuracy, and also the ditch directions, in which the function decays very slowly and for which the first derivatives do not yield a sufficient accuracy. This method works as follows. The initial points are chosen, and fast descent is carried out to the bottom of the ditch. Then these bottom points are used to approximate partial derivatives with respect to the ditch variables, and the direction of the descent along the bottom is chosen. The minimum is sought in this direction. Then the process is repeated. This method is ideologically similar to the two above ones, but it does not require any initial knowledge for dividing the variables into two groups. 5. The dogleg technique, which is widely used in English-speaking countries (the reference is to the golf term). In this case, the fastest descent is used for the fast variables and the slow ones are treated by making a step with the Newton technique.

6. Another widely used method is the trust region one [2]. The initial assumption is that the user can define the size of the area in which the approximation of the function is accurate enough. In this case, the function of the least-square method is minimized with a condition of the step being within the trusted region, that is, the following problem is solved:, using the Lagrange multiplier It can be seen that this technique is somewhat similar to the Kholessky method with a matrix, but with a different initial motivation. The following equation is used to find the Lagrange multiplier: It can also be written as:, where are eigenvectors and eigenvalues of the matrix P. It can be seen from this formula that, as the Lagrange multiplier is increased, the eigenvectors corresponding to the small eigenvalues are suppressed. It can be said that in the trusted region technique, we also have separation of the eigenvectors into two groups, corresponding to the big and small eigenvalues, respectively. The borderline between the two sets is not sharp, which is determined by the form of the restriction, which is chosen for the sake of the simplicity of the mathematical formulation. To summarize the brief consideration of the main classical techniques for the least squares function minimization, we can point out the following: All these methods use the idea of breaking all the possible search directions into two groups. The first group contains directions, in which the search can be successful. The latter contains directions of unsuccessful searches. In order to realize such a division, some initial information is needed. It can have different shapes, from a direct instruction to a very sophisticated algorithm for determining the search direction in the trusted region technique. All these methods have similar advantages and disadvantages. The advantages are in relative simplicity and, as a result, a high computational speed. This simplicity is achieved by using information available a priori. This information is not obtained from the least squares technique and has to be guessed by the user. If the information is incorrect, the method will have very poor performance. Even in the trusted region case, which is the most developed algorithm, a simple rotation of the elliptical region by 90 degrees will make the algorithm stuck for a long time. Using this initial information, which does not follow from the structure of the matrix, is the main disadvantage of the classical methods.

An alternative technique, which does not employ any information known a priori, but rather determines the direction of the minimization via analysis of the structure of the matrix A, is the method of Singular Value Decomposition (SVD) of the matrix A. Singular value decomposition and LS orbit determination For the SVD introduce (1) where U consists of the n orthonormal eigenvectors corresponding to the n largest eigenvalues of and V is the matrix of orthonormal eigenvectors of the matrix. The diagonal elements of S are the square roots of the non-negative values of the eigenvalues of. The are called the singular values. Now introduce the vectors then reduces to Since S is a diagonal matrix, the influence of each component of can be observed immediately. Introducing the component into the solution, we reduce the square of the norm of the residual by probe solutions The probe vector the singular values the probe solution vector Now let the singular values be in descending order and consider the is the normal pseudo-solution of the least-squares problem, if we disregard and consider them equal to zero. From the probe vectors we obtain from (2) (3) (4) (5) where is the j-th column of V. The corresponding square of the norm of the residual is (6) Now assume that A is poorly defined, that is, some of the singular values are widely separated. The corresponding may be too large due to the small singular values. Thus, one needs to find an index k, such that the norm of the probe vector and the norm of the residual for this probe solution are small enough. With the singular values in descending order the procedure is. 1. Develop the matrix of trial vectors.

(7) where is the j-th column of V. 2. For each trial vector compute the expected decrease of the least squares error function using (8) 3. For check the acceptability of the trial vectors. To accomplish this check each of their components and determine if the following inequality is satisfied: That is, determine if the changes in each of the elements are less than some prescribed amount. If the inequality, eq. (9), is not satisfied for some j, calculate the required coefficient of decreasing step size (9) (10) After checking all the components, set (11) If the inequality is satisfied for all components, go to the next step. If not, the trial vector normalized by multiplying it by. is 4. Now check the relative decrease of the SVD method as we go to the next trial vector. If (12) then the trial vector is taken as the next iteration. 5. If the inequality in Step 4 is not satisfied, then the previous trial vector is used. 6. After computing the least squares function with, determine if the SVD method is converging sufficiently. Determine if (13) If this inequality is satisfied, go to the next step. If it is not satisfied, then the modified least squares method is used. because the Hessian is degenerate and the residuals need to be considered. Examination of convergence area of an algorithm as a function of the error in the initial guess

Studying the area of convergence for various algorithms as a function of the initial guess accuracy is very interesting from the practical standpoint, but it is also very difficult because of the high dimensionality of the space of the parameters of the initial guess, and because of additional parameters (time of following the object, accuracy of the measurements, etc.). This is why these studies are usually carried out by testing a certain set of representative problems with typical difficulties such as presence of ditches, bad scaling of the variables, etc. In order to have an exact solution for the benchmarking, the problems are usually polynomial, and with a small number of variables. Real problems are much more difficult, this is why conclusions made using these test problems are not always confirmed. We prefer a different path: choosing a rather complex problem of determining an orbit using some measurements and to thoroughly study the convergence area in the problem. Then we try to understand what defines this convergence area. It is known that the most difficult case in determining orbits with the least square technique is the one when the measurements are optical ones. In this case, instead of all the 6 components of the phase vector, only two angular components are known. They are related to the phase vector in a very nonlinear way. Additional difficulties arise in calculating the residual. The angular residual is determined with plus/minus one turn. Therefore, if the coordinate residual is thousands and tens of thousands of kilometers, the angular residual can be small and is never greater than π. As far as the time of following is concerned, two classes can be named: one night and long time periods of following the object. At the same time, if the object is followed for one night and the number of checkpoints is small, the main problem is the degeneracy, and thus the accuracy of determining an orbit close to a degenerate one [3]. The nonlinearity is not as important in this case. For example, it is shown that using the SVD technique gives a significant advantage in the orbital prediction accuracy in the case when the number of the measurements is small. Also given are several typical examples for illustrating the speed of convergence. It is especially difficult to obtain a good convergence when one has to deal with a series of measurements made during a long period of time, but a good initial guess is lacking. In the case of the Russian space control center, there are two classes of orbits, which are followed over long periods of time with optical measurements. They are quasistationary orbits with a period close to 24 hours and highly elliptical half-day orbits of the Molniya ( Lightning ) type. The center has an algorithm for finding an initial guess with three pairs of angular measurements. It is based on an effective orbit determination technique using two positions of Bettin. However, if there are big or anomalous errors present in measurements, it could be difficult to find a good trio of measurement data. If there is a lapse in observations, orbits are determined by parts and they are then are pasted together.

Finally, the accuracy of the technique depends on how sensitive it is to the initial guess. This is why work was carried out to determine orbits with big time intervals and poor initial guesses. There is the natural initial guess for the stationary orbits: the big semi-axis corresponds to the 24-hour orbit, and all the other parameters are set to zero. The time interval is determined from the condition of having only one value of the residuals: depending on the error in the projected number of turns, there are many local minima in the big semi-axis. This is why for the given error in the initial guess with respect to the big semi-axis, the time interval was chosen in such a way that, with this error, the more-than-one-value problem is absent. Experiment has shown that convergence at the 10-day interval can be achieved with the natural initial guess. Finally, the most difficult case is found in the half-day highly elliptical satellites. Results of experiments with this class of orbits are considered in the next section. This section is devoted to a more difficult task of determining an orbit of a highly elliptical half-day object using optical measurements during a 8-week interval. One can hope that it is in these difficult problems that the SVD technique will be demonstrated to be superior to the classical methods. Study using a highly elliptical half-day object of the Molniya ( Lightning ) type). We have chosen an example in which a dense following of the object was carried out for 8 weeks with one station. Such a long following is an exception in our conditions. Usually, measurements are done on small intervals with big lapses between them. Such followings are not desirable for studying the convergence, since they make it impossible to observe a continuous picture of changing of the residuals. Usually, everything is fine until a lapse is reached, everything is bad after the lapse, and it is not clear why. Let us consider the results of the calculations with a brief comment, as an example. Presented in Table 1 are the initial values for the example in the elements of (the first line) and Kepler. Table 1 u(deg) Ω(deg) ω(deg) а(km) e i(deg) 1.54458 102899-0.518194-0.0885024 0.620487-0.253247 88.49 170.308 282.51 26563.4 0.670178 63.4304

Given in Table 2 is an example of convergence in the case when a good initial guess is available. Given in the first column is the iteration number. The second gives the value of the functions. The other seven columns contain elements of the orbit obtained for the specific iteration, in two versions. The top line has the Lagrangian elements, the bottom one the Kepler elements. The two title lines contain the usual letter notations for the elements. The following tables will have a similar structure. Table 2 Iter F λ L p q h k u Ω ω а(km) e i 0 1.10e+08 1.54614 102898-0.518196-0.0885043 0.678499-0.273246 88.49 170.256 282.745 26564.1 0.668579 63.4264 1 2.75e+06 1.54474 102899-0.518161-0.0885209 0.620503-0.253281 88.49 170.305 282.51 26563.3 0.670205 63.4264 2 637.825 1.54458 102899-0.518194-0.0885024 0.620487-0.253247 88.49 170.308 282.51 26563.4 0.670178 63.4304 3 99.1189 1.54458 102899-0.518194 ;-0.0885024 0.620487-0.253247 88.49 170.308 282.51 26563.4 0.670178 63.4304 4 99.1187 1.54458 102899-0.518194-0.0885024 0.620487-0.253247 88.49 170.308 282.51 26563.4 0.670178 63.4304 As can be seen from the table, only the h,k elements have small changes. But these changes cause 6-order changes in the function. Table 3 contains results of solving a more difficult example problem. In this case, the elements of the orientation of the plane in the initial guess are turned by 140 degrees!!! As can be seen from the table, the minimization proceeded in a strange way, with practically all elements changing. This example demonstrated two peculiarities of the highly elliptical object which are not present for the stationary ones. Table 3 Iter F λ L p q h k u Ω ω A(km) e i

0 5.980e+10 1.54447 102899 0.517825 0.291791 0.620471-0.253246 88.4917 29.401 82.8019 26563.4 0.670162 72.9366 1 4.590e+10 0.562329 102899 0.2818150 0.400643-0.01049-0.385884 32.2191 54.8773-233.319 26563.4 0.386027 58.659 2 2.940e+10 1.14273 102923-0.113143 0.190669-0.32036 0.0681839 120.685-198.67 26576 0.327542 25.6192 3 8.426e+09 1.15088 102925-0.375606-0.409918 0.0121758-0.130572 132.499 307.171 26576.8 0.131139 67.5564 4 7.148e+08 1.08891 102920-0.42987-0.216793 0.289663-0.342379 153.237 293.005 26574.1 0.448473 57.5593 5 9.030e+07 1.54573 102899-0.518105-0.0894337 0.61991-0.254829 170.206 282.553 26563.6 0.670243 63.4398 6 546237 1.54459 102899-0.518189-0.0885073 0.620487-0.253249 170.307 282.51 26563.4 0.670179 63.4299 7 98.5423 1.54458 102899-0.518194-0.0885024 0.620488-0.253247 170.308 282.51 26563.4 0.670178 63.4304 At the first iteration, the calculated step has lead to an unacceptable point. An attempt to calculate the orbital elements (by Kepler) would lead to a crash. This is why special restrictions were introduced into the algorithm for determining various parameters at a step. If the parameter values come to be unacceptable, the step size is divided by 5. A similar situation emerged with the perigee height at the second iteration the object went underground. Once again, the guard worked and the step was divided by two. All the following steps proceeded without such scaling. The convergence became pretty fast at the last iterations. We will analyze the minimization process with the SVD technique. Given in Table 4 are the results of the first iteration calculations for the test vectors y k using the equations 1 3, and the symbol x is used to denote the Gauss-Newton step vector obtained with equation 7.

Considering this vector, we can see that the value of the parameter p is such that this step does not satisfy the condition of. This is why the first iteration step in the Newton-Gauss case is divided by 5. Let us now consider the process of solving the same problem with the SVD method. Let us calculate a sequence of test vectors using equation 7 and the data in Table 4. The resulting first iteration vectors are listed in Table 5. It is easy to see that the test vectors for numbers 4 and above do not satisfy the natural conditions for the elements h, k. Therefore, let us consider the test vector with the dimensionality of 3. Results of calculations with this restriction are given in Table 6. Comparison of the first three rows of this table with the corresponding rows in Table 3 demonstrates a faster convergence. Table 4 1 2 3 4 5 6 x λ 0.001492-0.006011 0.192775 0.222848-3.88298-1.4281-4.8997 L 5.081e-05 2.9421e-05-0.00112 0.003652 0.034711-187.559-187.52 p 0.01538-0.14817-0.02940-0.489543-0.24947-0.25649-1.1577 q -0.0180-0.09886-0.25619 0.643738-0.04789 0.31518 0.5378 h -0.0086 0.04078-0.46120-0.574383-1.47969-0.6764-3.1596 k -0.027-0.0312 0.310331-0.507001 0.14993-0.5647-0.6708 Table 5 1 2 3 4 5 6 λ 0.0015-0.0045 0.1883 0.4111-3.4719-4.9

L 0.0001 0.0001-0.001 0.0026 0.0373-187.5217 p 0.0154-0.1328-0.1622-0.6517-0.9012-1.1577 q -0.0181-0.117-0.3731 0.2706 0.2227 0.5379 h -0.0087 0.0321-0.4291-1.0035-2.4832-3.1596 k -0.0273-0.0586 0.2517-0.2553-0.1054-0.6701 Table 6 Iter F λ L p q h k 0 5.980e+10 1.54447 102899 0.517825 0.291791 0.620471-0.253246 1 3.060e+10 1.73273 102899 0.355634-0.0813544 0.191358-0.00153815 2 6.495e+09 1.75762 102899-0.254545-0.504166 0.15468-0.430541 3 3.414e+09 1.75225 102899-0.186484-0.375743 0.178891-0.395942 Conclusion This work is a continuation of the one described in [3]. Altogether, these two articles consider the main types of orbits and demonstrate the possibility of using an SVD-based minimization technique for the minimization. The general conclusion from this research can be formulated as follows: the more complex the minimization problem, the greater advantage in robustness is provided by the SVD technique. There are some real application cases when convergence for a certain subset of dimensions cannot be obtained with the conventional methods. In these cases, it would be very beneficial to have an estimate of the guaranteed convergence area for the minimization algorithm based on experience. This is why it makes sense to continue studying algorithms based on the SVD method for the difficult case of highly elliptical orbits and to attempt to obtain at least a rough estimate for the parametric set of convergence

References 1. P.E. Gill, W. Murray, M.H. Wright, Practical Optimization, Academic Press, 1981. 2. A. R. Conn, N. I. M. Gould, and Ph. L. Toint, Trust-Region Methods. No. 1 in the MPS SIAM series on optimization. SIAM, Philadelphia, USA, 2000. 3. V.F. Boykov, Z.N. Khutorovskiy, K.T. Alfriend. Singular value decomposition and least squares orbit determination, 7 th US/Russian Space Surveillance Workshop Proceeding, 2007.