Answers to Selected Exercises in Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Similar documents
Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

OPTIMAL DESIGN INPUTS FOR EXPERIMENTAL CHAPTER 17. Organization of chapter in ISSO. Background. Linear models

FDSA and SPSA in Simulation-Based Optimization Stochastic approximation provides ideal framework for carrying out simulation-based optimization

Topic 19 Extensions on the Likelihood Ratio

Recent Advances in SPSA at the Extremes: Adaptive Methods for Smooth Problems and Discrete Methods for Non-Smooth Problems

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

A Few Notes on Fisher Information (WIP)

Computer Intensive Methods in Mathematical Statistics

Module 2. Random Processes. Version 2, ECE IIT, Kharagpur

1216 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 54, NO. 6, JUNE /$ IEEE

Statistics & Data Sciences: First Year Prelim Exam May 2018

Exercises Chapter 4 Statistical Hypothesis Testing

Part 6: Multivariate Normal and Linear Models

UNIVERSITY OF TORONTO Faculty of Arts and Science

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Unit 5a: Comparisons via Simulation. Kwok Tsui (and Seonghee Kim) School of Industrial and Systems Engineering Georgia Institute of Technology

Accurate Maximum Likelihood Estimation for Parametric Population Analysis. Bob Leary UCSD/SDSC and LAPK, USC School of Medicine

EFFICIENT CALCULATION OF FISHER INFORMATION MATRIX: MONTE CARLO APPROACH USING PRIOR INFORMATION. Sonjoy Das

Visual interpretation with normal approximation

Topic 15: Simple Hypotheses

Machine learning - HT Maximum Likelihood

Next is material on matrix rank. Please see the handout

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Monte Carlo simulation inspired by computational optimization. Colin Fox Al Parker, John Bardsley MCQMC Feb 2012, Sydney

Preliminary statistics

Supporting Information for Estimating restricted mean. treatment effects with stacked survival models

Cramér Rao Bounds and Monte Carlo Calculation of the Fisher Information Matrix in Difficult Problems 1

The Finite Sample Properties of the Least Squares Estimator / Basic Hypothesis Testing

MATH c UNIVERSITY OF LEEDS Examination for the Module MATH2715 (January 2015) STATISTICAL METHODS. Time allowed: 2 hours

Lecture 7 Introduction to Statistical Decision Theory

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Efficient Monte Carlo computation of Fisher information matrix using prior information

DUBLIN CITY UNIVERSITY

Estimation of uncertainties using the Guide to the expression of uncertainty (GUM)

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

Practice Problems Section Problems

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Evolution of the Average Synaptic Update Rule

Institute of Actuaries of India

Psychology 282 Lecture #4 Outline Inferences in SLR

Control Variates for Markov Chain Monte Carlo

A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data

Assessing the Effect of Prior Distribution Assumption on the Variance Parameters in Evaluating Bioequivalence Trials

Exercises and Answers to Chapter 1

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

Linear models and their mathematical foundations: Simple linear regression

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Some Approximations of the Logistic Distribution with Application to the Covariance Matrix of Logistic Regression

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

The Kalman Filter. Data Assimilation & Inverse Problems from Weather Forecasting to Neuroscience. Sarah Dance

Statistical tests for a sequence of random numbers by using the distribution function of random distance in three dimensions

Liquidity Preference hypothesis (LPH) implies ex ante return on government securities is a monotonically increasing function of time to maturity.

Inferences about a Mean Vector

Online appendix to On the stability of the excess sensitivity of aggregate consumption growth in the US

Partially Collapsed Gibbs Samplers: Theory and Methods. Ever increasing computational power along with ever more sophisticated statistical computing

First Year Examination Department of Statistics, University of Florida

Lecture 13: p-values and union intersection tests

Intelligent Embedded Systems Uncertainty, Information and Learning Mechanisms (Part 1)

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

Uniform Random Number Generators

Design of Engineering Experiments Part 2 Basic Statistical Concepts Simple comparative experiments

Long-Run Covariability

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

ISyE 6644 Fall 2014 Test 3 Solutions

Improved Methods for Monte Carlo Estimation of the Fisher Information Matrix

MAS223 Statistical Inference and Modelling Exercises

Mustafa H. Tongarlak Bruce E. Ankenman Barry L. Nelson

Statistics Introductory Correlation

Economics 583: Econometric Theory I A Primer on Asymptotics

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

F79SM STATISTICAL METHODS

Wald s theorem and the Asimov data set

An interval estimator of a parameter θ is of the form θl < θ < θu at a

Monte Carlo Methods. Leon Gu CSD, CMU

parameter space Θ, depending only on X, such that Note: it is not θ that is random, but the set C(X).

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

Short course A vademecum of statistical pattern recognition techniques with applications to image and video analysis. Agenda

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

IDENTIFICATION FOR CONTROL

Basic Concepts in Data Reconciliation. Chapter 6: Steady-State Data Reconciliation with Model Uncertainties

Variational Autoencoder

A comparison of inverse transform and composition methods of data simulation from the Lindley distribution

9 Multi-Model State Estimation

Gibbs Sampling in Linear Models #2

STA205 Probability: Week 8 R. Wolpert

Master s Written Examination

Probability Models. 4. What is the definition of the expectation of a discrete random variable?

A Study of Covariances within Basic and Extended Kalman Filters

Expectation Maximization

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Simulating Uniform- and Triangular- Based Double Power Method Distributions

This document consists of 14 printed pages.

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 15-7th March Arnaud Doucet

UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables

Basics of reinforcement learning

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Stock Prices, News, and Economic Fluctuations: Comment

Bayesian data analysis in practice: Three simple examples

Transcription:

Answers to Selected Exercises in Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Spall This section provides answers to selected exercises in the chapters and appendices. An asteris (*) indicates that the solution here is incomplete relative to the information requested in the exercise. Unless indicated otherwise, the answers here are identical to those in the text on pp. 55 557 (it is possible that future versions of this site may include additional information, including solutions to supplementary exercises that are posted on the boo s Web site). CHAPTERS 1 17 1.. On the domain [ 1, 1], the maximum occurs at θ = 1 5 and the minimum occurs at θ = 1. 1.1.* The table below provides the solution for the first two (of six) iterations: θ ˆT L( θ ˆ ) [, 3.] 5. 1 [.78, 1.51].369 [.5, 1.8].643 (Note: This problem corresponds to Example 8.6. in Bazaraa et al., 1993. The numbers here differ slightly from Bazaraa et al. This is apparently due to the limited accuracy used in Bazaraa et al.).3.* A first-order Taylor approximation (Appendix A) yields 1 log(1 P ) log(1) P 1 P = P. P = This implies from (.4) that the number of loss measurements n satisfies n = logρ log( 1 P ) logρ P for small P, as was to be shown. 3.4. (b) With a =.7, the sample means of the terminal estimates for β and γ are.91 and.5, respectively (versus true values of.9 and.5).

4.3.* The table below shows the sample means of the terminal estimate from replications. Under each sample mean is the approximate 95 percent confidence interval using the principles in Section B.1 (Appendix B). n = 1 n = 1, n = 1,, Sample mean of terminal estimate (and 95% interval).79 [.7,.76].9 [.96,.934].974 [.97,.977] 5.1.* The table below presents the solution for the recursive processing with one pass through the data for one value of a. The mean absolute deviation (MAD) is computed with respect to the test data in reeddata-test. a =. β ˆ =.183,.365.33.319.43 ˆ.37 ˆ.34 = β, B =.38.6.341.39.95.55 MAD=. 3647 6.4. For convenience, let X = β / ( θˆ θ ). It is assumed that X converges in distribution to N(µ FD, Σ FD ). From Figure C.1 in Appendix C, pr. and a.s. convergence are both stronger than dist. convergence. However, even pr. and a.s. convergence, in general, are not strong enough to guarantee that the mean of the process converges (see Example C.4). Therefore, E(X ) / µ FD in general. 7.3. With valid (Bernoulli) perturbations, the mean terminal loss value L(ˆ θ 1) is approximately.3 over several independent replications. For the (invalid) uniform [ 3, 3] and N(, 1) perturbation distributions, the θ estimate quicly diverges. Based on several runs, the terminal loss value was typically between 1 5 and 1 1 for the uniform case and much greater than 1 1 for the normal case. This illustrates the danger of using an invalid perturbation distribution, even one that has the same mean and variance as a valid distribution. 8.6. SAN is run with the coefficients of Example 8. on the indicated two-dimensional quartic loss function. The table below shows the sample-mean terminal loss values for initial temperatures of.1,.1, and 1. based on an average over 4 replications. The results for the lowest and highest initial temperature are poorer than the results in Table 8.; the results for the intermediate initial temperature (.1) are pulled directly from Table 8..

N T init =.1 T init =.1 T init = 1. 1 3.71.91 8.75 1.1.67 6.11 1,.49.4.141 9.7.* The probability of a given (selected) binary-coded chromosome being changed from one generation to the next is bounded above by P(crossover mutation), where the event mutation refers to at least one bit being mutated. (The probability is bounded above because crossover does not guarantee that the chromosome will be changed. For example, one-point crossover applied as indicated to the following two distinct chromosomes does not change either chromosome: [ 1 1 1 ] and [1 1 1 1 ].) Note that P(crossover mutation) = P(crossover) + P(mutation) P(crossover mutation) B B c m c m = P + ( 1 (1 P ) ) P ( 1 (1 P ) ). The probability of passing intact is therefore bounded below by 1 P(crossover mutation). (Note: The solution above represents a clarification of the solution on p. 554 of the first printing of the text to reflect that the probability P(crossover mutation) is an upper bound versus equality to the probability of change in the chromosome.) 1.8.* Applying Stirling s approximation for large N and B yields N N B 1 1/ 1 B 1 N 1 1 1+ 1+ + N. π 1 1 N P B B 11.3. (b) One aspect suggesting that the online form in (11.11) is superior is that it uses the most recent parameter estimate for all evaluations on the right-hand side of the parameter update recursion. The alternative form of TD in this exercise uses older parameter estimates for some of the evaluations on the right-hand side. Another potential shortcoming noted by Sutton (1988) is that the differences in the predictions will be due to changes in both θ and the inputs (the x τ ), maing it hard to isolate only the effects of interest. In contrast to the recursion of this exercise, the temporal difference in predictions on the right-hand side of the recursion in (11.11) is due solely to changes in the input values (since the parameter is the same in all predictions). 1.4. Under the null hypothesis, L(θ m ) = L(θ j ) for all j m. Further, E(δ mj ) = for all j m by the assumption of a common mean for the noise terms. Then

cov( δ, δ ) = E[( L L )( L L )] mj mi m j m i = var( L ), m j m i m + i = E( L ) E( L L ) E( LL ) E( LL j ) m where the last line follows by the mutual uncorrelatedness over all i, j, and m (so, e.g., E( LL j m) = E( L ) E( L ) = [ EL ( m )] for j m). j m 13.7. Using MS EXCEL and the data in reeddata-fit, we follow the approach outlined in Example 13.5 in terms of splitting the data into four pairs of fitting/testing subsets. The four MSE values are.9,.146,.157, and.181 and the four MAD values are.61,.38,.33, and.38. The modified Table 13.3 is given below; the values in italics are from Table 13.3 (so only the last column is new). The full linear model produces the lowest RMS and MAD values. Full linear model (13.9) Reduced linear model (13.1) Five-input linear model RMS.37.386.379 MAD.66.36.31 14.7. (b) We approximate the CRN-based and non-crn-based covariance matrices using the sample covariance matrix from 1 6 gradient approximations generated via Monte Carlo simulation (this large number is needed to get a stable estimate of the matrix). The approximate covariance matrices are as follows: CRN: ˆ ˆ 445. 44.5 cov [ g ˆ( θ) θ] 44.5 465.3, Non-CRN: ˆ ˆ 5936. 93.5 cov [ g ˆ( θ) θ] 93.5 5956.4. The above show that for c =.1, the CRN covariance matrix is much smaller (in the matrix sense) than the non-crn covariance matrix. 15.3. We have log p V (υ θ) = log p V (υ λ) = υ logλ + (1 υ) log(1 λ). Hence, log p V ( υ λ) λ 1 = υλ ( 1 υ) ( 1 λ). This simplifies to λ (λ υ) (λ 1), as shown in the first component of log p V ( υ θ) θ in Example 15.4. The second component in log p V ( υ θ) θ is zero since there is no dependence of p V (υ θ) on β. The final form for Y ( ˆ θ) then follows immediately using expression (15.1). 16.4.* Let X be the current state value and W be the candidate point. The candidate point W is accepted with probability ρ(x, W). This probability is random as it is depends on X and W. For

convenience, let π ( xw, ) = p( w) q( x) [ p( x) q( w )] (so ρ ( xw, ) = min { π( xw, ), 1} according to (16.3)). The mean acceptance probability is given by Ε[ρ ( XW, )] = p ( x) q ( w) d x d w+ π ( xw, ) p ( x) q ( w) d x d w. π( xw, ) 1 π ( xw, ) < 1 After several steps involving the re-expression of the integrals above (reader should show these steps), it is found that E[ρ(X, W)] = p( x) q( w) dxdw. The result to be proved then π( xw, ) 1 follows in several more steps (reader to show) by invoing the given inequality p(x) Cq(x). (Incidentally, the form of M-H where q(w x) = q(w) is sometimes called the independent M-H sampler.) 16.6. For the bivariate setting here, the general expression in (16.9) simplifies considerably. In particular, µ i = µ \i =, Σ i,\i = ρ, and Σ \i,\i = 1. Hence, the bivariate sampling for the Gibbs sampling procedure is: X +1,1 N(ρX, 1 ρ ) and X +1, N(ρX +1,1, 1 ρ ). 17.7. This problem is a direct application of the bound in property (iii) as shown in Subsection 17... The bound yields 1.98 det M( ξ n ) 1.5. 17.17. (a) As in Example 17.1, consider the information number at each measurement (instead of x F n ). This value is xe θ. Hence, the modified D-optimal criterion is ρx exp( θ 1 x) + (1 ρ)x eexp( θ x). The derivative with respect to x yields ρ[ x exp( θ 1 x) θ 1 x exp( θ 1 x)] + (1 ρ)[ x exp( θ x) θ x exp( θ x)]. Simplifying and setting to zero yields the following transcendental equation for x: ρ exp( θ 1 x)( 1 θ 1 x) + (1 ρ) exp( θ x)(1 θ x) =. This is a starting point for finding the support point for the design; we also need to ensure that the solution is a maximum (not a minimum or saddlepoint). APPENDICES A E A.3. (b) A plot of f (θ, x) is given below. The plot depicts the continuity of f (θ, x) on Θ Λ, but also illustrates that there are some sharp edges where the derivative will not be continuous. Hence, not all conditions of Theorem A.3 are satisfied.

B.5.* With the exception of the choice of matrix square root, we follow the steps of Exercise B.4. Using the chol function in MATLAB, the upper triangular Cholesy factorization of the initial covariance matrix is 1..8.6. In transforming the data, note that (Σ 1/ ) T Σ 1/ = Σ (but Σ 1/ (Σ 1/ ) T Σ!). We find that the t-statistics are.74 and.734 for the matched and unmatched pairs, respectively. This leads to corresponding P-values (for a one-sided test) of.38 (9 degrees of freedom) and.36 (18 degrees of freedom), respectively. These values are close to one another. Hence, the merit of using the matched-pairs test has been almost totally lost as a result of the relatively low correlation here. r r C.4. (b) Note that E ( X X ) = E ( ) X X X X. Because < r < q, the Hölder inequality can be applied to the right-hand side of this expression: r r r ( X ) ( ) ( ) X X X X X q { E ( X X )} q/ r / q q/ ( q r) ( q r) / q E E E q But E ( above. This completes the proof. = r/ q r X X ) as by assumption, implying E ( ) 1. X X from the bound D.1. Given J = a = c = 3 and modulus = 5, we have J 1 = aj + c (mod m) = (3 3 + 3) (mod 5) =. Continuing the iterative process, it is found that J4 = J. Therefore, this generator has period 4, which is smaller than the modulus. E.1. From the matrix P 4 in Example E., we hypothesize that p = [.35,.56,.439] T. As a first chec on the validity of this solution, note that the elements sum to unity. Based on the given

T T P, it is then straightforward to chec the defining relationship p = p P given in (E.). This indicates that to within three digits, the hypothesized p is, in fact, the actual p.