MATH444.0 ASSIGNMENT 03 SOLUTIONS 4.3 Newton s method can be used to compute reciprocals, without division. To compute /R, let fx) = x R so that fx) = 0 when x = /R. Write down the Newton iteration for this problem and compute by hand or with a calculator) the first few Newton iterates for approximating /3 starting with x 0 = 0.5 and not using any division. What happens if you start with x 0 =? For positive R, use the theory of fixed point iteration to determine an interval about /R from which Newton s method will converge to /R. Solution. Given fx) = x R, we have f x) = x 2. The Newton iteration for this problem then is x k = x k fx k ) f x k ) = x k x k R = 2x k Rx 2 k. x 2 k To approximate /3, we use the above with R = 3. With x 0 = 0.5 we get x = 20.5) 30.5) 2 = 0.25 x 2 = 20.25) 30.25) 2 = 0.325 x 3 = 20.325) 30.325) 2 = 0.3320325 x 4 = 20.3320325) 30.3320325) 2 = 0.333328247070325 which demonstrates rapid convergence to 0.3. On the other hand, with x 0 =, we see x = 2) 3) 2 = x 2 = 2 ) 3 ) 2 = 5 x 3 = 2 5) 3 5) 2 = 85 x 4 = 2 85) 3 85) 2 = 2845 which demonstrates rapid divergence. Considering Newton s method as a fixed-point iteration, we see that it will converge for x 0 chosen in an interval [a, b] which the function gx) = 2x Rx 2 maps into itself and for which g exists and is strictly bounded above by. Noting g is a polynomial and therefore differentiable, we note that g x) = 2 2Rx, and so g x) < 2R x 3 2R. All that remains to show is that g maps this interval into itself. To do this, maximize and minimize g on the interval [/2R, 3/2R]. The maximum occurs at /R and is /R while the minimum occurs at either endpoint, and takes the value 3/4R. Clearly g maps the interval into itself, and therfore for any choice of initial guess x 0 in [/2R, 3/2R] will give convergence.
4.2 Let ϕx) = x 2 + 4)/5. a) Find the fixed points) of ϕx). Solution. The fixed points of ϕ satisfy x = x2 +4 5 and are the roots of fx) = x 2 5x + 4. The fixed points of ϕ are x =, 4. b) Would the fixed point iteration x k+ = ϕx k ) converge to a fixed point in the interval [0, 2] for all initial guesses x 0 [0, 2]? Solution. In order for fixed-point iteration to converge for all initial guesses x 0 [0, 2], ϕ must map [0, 2] into itself and ϕ x) < for all x [0, 2]. Note that ϕ x) = 2x. On [0, 2], 0 5 ϕ x) 4. Also, the minimum value of ϕx) 5 on this interval is 4 0 and the maximum value is 8 2. Thus ϕ satisfies the 5 5 hypotheses of the fixed-point theorem, and so fixed-point iteration converges to x = for all initial guesses in [0, 2]. 4.6 Steffensen s method for solving fx) = 0 is defined by where x k+ = x k fx k) g k, g k = f x k + fx k )) fx k ). fx k ) Show that this is quadratically convergent, under suitable hypotheses. Solution. Let gx) = fx + fx)) fx). fx) Supposing that x is a simple root of f that is f x ) 0), then gx), while not defined at x satisfies the property fx + fx)) fx) lim gx) = lim x x x x fx) = lim x x = f x ). f x + fx)) + f x)) f x) f x) Similarly, g is differentiable in a neighborhood of x, again, presuming that f is, and so we can argue that given φx) = x fx) gx) then φx) φx ) lim = 0 x x x x and therefore, Steffensen s method converges quadratically to the root x. 5.2 Write down the IEEE double-precision representation for the decimal number 50.2, using round to nearest. 2
Solution. Since 50.2 is positive, the sign bit here is s = 0. Also, 2 5 = 32 50.2 < 2 6 = 64 so, the exponent is 5. Adding the bias, then, the exponent bits will be the bits that represent 028 = 000000000 2 = 0x404, the last number being the hexadecimal representation. All that remains is to determine the mantissa. To do so, we need to note that 50 = 32 + 6 + 2 = 000 2. Also, we need to determine the binary representation of 0.2. Note that 0.2 = 5 = 8 + 3 40 Here = 80 6 So far, then, we have = 8 + 6 + 80, so 5 80 = 6 8 + 6 + ) 80 = 2 7 + 2 8 + 2 8 5 0.2 = 2 3 + 2 4 + 2 7 + 2 8 + 2 8 0.2) Continuing, then it is easy to see that And so 0.2 = 2 3 + 2 4 + 2 7 + 2 8 + 2 + 2 2 + 2 2 0.2) = 2 3 + 2 4 + 2 7 + + 2 44 + 2 47 2 48 + 2 48 0.2). 50.2 = 2 5 + 2 4 + 2 + 2 3 + 2 4 + 2 7 + + 2 44 + 2 47 + 2 48 + 2 48 0.2) = 2 5 + 2 + 2 4 + 2 8 + 2 9 + 2 2 + + 2 49 + 2 52 + 2 53 + 2 53 0.2) ) Represented in binary, then, we have 50.2 = 2 5.000000000 00 2 + 2 53 + 2 53.2) ). Rounding to nearest causes the last four bits to change from 00 2 to 00 2 because the bit after the fifty-second bit is that is, that 2 53 appears there). Putting all of the above together, we see that the double-precision representation of 50.2 in round to nearest is 0 000000000 000000000000000000000000000 where the bars show the separation between the sign bit, the exponent and the mantissa. In hexadecimal this is written as 0x40499999999999a. 5.3 What is the gap between 2 and the next larger double-precision number? 3
Solution. To determine the gap between 2 and the next larger number, determine the exponent for 2, and multiply by ε = 2 52. Here 2 2 < 2 2, therefore the gap between 2 and the next larger double-precision number is 2 5. This can be verified using Matlab: log2eps2)) -5 5.4 What is the gap between 20 and the next larger double-precision number? Solution. To determine the gap between 20 and the next larger number, determine the exponent for 2, and multiply by ε = 2 52. Here 2 7 = 28 20 < 2 8, therefore the gap between 20 and the next larger double-precision number is 2 52+7 = 2 45. This can be verified using Matlab: log2eps20)) -45 3) a) Show that ln x ) x 2 = ln x + ) x 2 Solution. Note that ) + ln ln x x 2 Therefore x + x 2 ) = ln x ) x 2 x + )) x 2 = ln x 2 x 2 ) ) = ln) = 0 ln x ) x 2 = ln x + ) x 2 b) Which of the two formulas is more suitable for numerical computation? Explain why and provide a numerical example in which the difference in accuracy is evident. Solution. The second formula, that is ln x + x 2 ), is more suitable for numerical computing because it doesn t suffer the cancellation that the first formula does. The following Matlab script demonstrates that the first formula is more inaccurate in the worst case than the second formula: f = @x) logx-sqrtx.ˆ2-)); % First formula g = @x) -logx+sqrtx.ˆ2-)); % Second formula 4
% Note that with x = sqrtyˆ2 + ), then expfx)) == x-y) should % be the case. We can use this to evaluate the accuracy in general % of the two formulae. Let's look at err = @f, x, y) absexpfx))-x-y))./absx))./eps; % which gives the relative error in terms of unit round-off. y = 0:0000; x = sqrty.ˆ2+); figure; plotx, errf, x, y)); hold on; plotx, errg, x, y)); setgca, 'XLim', [0 ceilmaxx))]); legend'error in First formula', 'Error in Second formula'); The generated plot looks like this: The other trouble is that the first formula results in for all double precision numbers larger than 2 53 while the second formula delays overflow until after realmax 2 52. Here is an example: >> x = 2.^27:5); >> allisinffx))) >> allisfinitegx))) 5
4) For the following expressions, state the numerical difficulties that may occur, and rewrite the formulas in a way that is more suitable for numerical computation: a) x + x x, where x >>. x Solution. When x is much larger than, x 0 and so x + x x x x x, demonstrating catastrophic cancellation. This should be rewritten as x + x x x = x + x + x x x + x + x x ) ) x + x x x = x + x + x x 2 = ). x x + x + x x This formulation no longer suffers cancellation when x >> and will underflow more gradually as x. b) +, where a 0 and b. a 2 b 2 Solution. For a 0 the value will overflow whenever a < a 2 rmax where r max = realmax, the largest finite floating point number in a given floating-point system. As a result, getting a 2 out of the denominator is desired. One possible way to do this is a + 2 b = 2 a + a2 2 a 2 b 2 = a ) 2. + a b In the case that b < the above should be preferred and it won t suffer overflow until a becomes subnormal. If b >, b 2 +a 2 might be slightly preferred since ab that will delay overflow. c) a 2 + b 2 2ab sinθ) where a b and θ π 2. Solution. When a b and θ π 2, a2 + b 2 2ab sinθ) 0 and in fact, in floating point, may in fact turn out to be negative, which will never occur in exact arithmetic. For example, >> a = 0.3; b = a-epsa)/2; theta = pi/2); >> a.^2 + b.^2-2*a.*b.*sintheta) -2.7756e-7 6
To maintain the property that this expression is never negative, rewrite it as a 2 + b 2 2ab sinθ) = a 2 sin 2 θ) + cos 2 θ) ) + b 2 2ab sinθ) = a 2 sin 2 θ) 2ab sinθ) + b 2 + a 2 cos 2 θ) = a sinθ) b) 2 + a cosθ)) 2 This is the sum of non-negative numbers and it will always be non-negative. 5) Consider the linear system a ) b x = b a) y 0) with a, b > 0; a b. a) If a b, what is the numerical difficulty in solving this linear system? Solution. It is easy to see that the correct solution to this linear system is a x = y = b a 2 b 2 a 2 b. 2 When a b the denominator suffers cancellation and becomes very small, resulting in inaccuracies when computing x, y. a b results in a nearly singular system, and in fact when a = b the system is singular and there is no solution. b) Suggest a numerically stable formula for computing z = x + y given a and b. Solution. Noting that a + b)x + a + b)y = then the sum z = x + y can be computed simply as z =. As long as a, b are a+b both positive and not subnormal, then z will be easily computed as there is no cancellation. c) Determine whether the following statement is true or false, and explain why: When a b, the problem of solving the linear system is ill-conditioned but the problem of computing x + y is not ill-conditioned. Solution. This statement is true. When a b, it is hard to compute x, y accurately because of cancellation in forming a 2 b 2, and the closer a is to b, the worse it gets. As demonstrated inn part b), however, when both a, b are positive, computing x + y can be easily done relatively accurately, regardless of the values of a, b, provided they are not too small. 7