Learning with Decision-Dependent Uncertainty

Size: px

Start display at page:

Download "Learning with Decision-Dependent Uncertainty"

Madison Heath
5 years ago
Views:

1 Learning with Decision-Dependent Uncertainty Tito Homem-de-Mello Department of Mechanical and Industrial Engineering University of Illinois at Chicago NSF Workshop, University of Maryland, May 2010 Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

2 Learning and deciding Many decision-making problems fit the following framework: Collect data Estimate parameters Make decision Make new decision Update parameter estimates Collect more data Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

3 Convergence Does this process converge? This is clearly the case when the observed data are i.i.d. observations from a given distribution F. In that case, ˆF n is the empirical distribution corresponding to samples from F, so under mild conditions ˆx n x. But what if the demand depends on the decisions? Many cases (e.g. some revenue management problems) fall into that framework. In that case, the true distribution is F = F (x, ). If the dependence is ignored, the forecast/decision process converges to x satisfying x = argmin E F (x, )[G(x, Y )], which can be suboptimal. (Cooper+HM+Kleywegt 2006, Lee+HM+Kleywegt 2009) Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

4 Convergence Does this process converge? This is clearly the case when the observed data are i.i.d. observations from a given distribution F. In that case, ˆF n is the empirical distribution corresponding to samples from F, so under mild conditions ˆx n x. But what if the demand depends on the decisions? Many cases (e.g. some revenue management problems) fall into that framework. In that case, the true distribution is F = F (x, ). If the dependence is ignored, the forecast/decision process converges to x satisfying x = argmin E F (x, )[G(x, Y )], which can be suboptimal. (Cooper+HM+Kleywegt 2006, Lee+HM+Kleywegt 2009) Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

5 Convergence Does this process converge? This is clearly the case when the observed data are i.i.d. observations from a given distribution F. In that case, ˆF n is the empirical distribution corresponding to samples from F, so under mild conditions ˆx n x. But what if the demand depends on the decisions? Many cases (e.g. some revenue management problems) fall into that framework. In that case, the true distribution is F = F (x, ). If the dependence is ignored, the forecast/decision process converges to x satisfying x = argmin E F (x, )[G(x, Y )], which can be suboptimal. (Cooper+HM+Kleywegt 2006, Lee+HM+Kleywegt 2009) Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

6 Convergence Does this process converge? This is clearly the case when the observed data are i.i.d. observations from a given distribution F. In that case, ˆF n is the empirical distribution corresponding to samples from F, so under mild conditions ˆx n x. But what if the demand depends on the decisions? Many cases (e.g. some revenue management problems) fall into that framework. In that case, the true distribution is F = F (x, ). If the dependence is ignored, the forecast/decision process converges to x satisfying x = argmin E F (x, )[G(x, Y )], which can be suboptimal. (Cooper+HM+Kleywegt 2006, Lee+HM+Kleywegt 2009) Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

7 Learning the dependence Can we learn the structure of the dependence of F on x, and simultaneously optimize the objective function? For example, consider the simple case where Y n = β 0 + β 1 x + ε n, where β 0 and β 1 are unknown constants and {ε n } is i.i.d. G(x, Y ) = xy. At iteration n, Estimate β 0 and β 1 from ˆx 1,..., ˆx n using least-squares; Compute ˆx n+1 = ˆβ 0 n 2 ˆβ. 1 n This kind of procedure is called certainty equivalent control in the control and econometrics literature. Very hard to show convergence! Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

8 Learning the dependence Can we learn the structure of the dependence of F on x, and simultaneously optimize the objective function? For example, consider the simple case where Y n = β 0 + β 1 x + ε n, where β 0 and β 1 are unknown constants and {ε n } is i.i.d. G(x, Y ) = xy. At iteration n, Estimate β 0 and β 1 from ˆx 1,..., ˆx n using least-squares; Compute ˆx n+1 = ˆβ 0 n 2 ˆβ. 1 n This kind of procedure is called certainty equivalent control in the control and econometrics literature. Very hard to show convergence! Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

9 Learning the dependence Can we learn the structure of the dependence of F on x, and simultaneously optimize the objective function? For example, consider the simple case where Y n = β 0 + β 1 x + ε n, where β 0 and β 1 are unknown constants and {ε n } is i.i.d. G(x, Y ) = xy. At iteration n, Estimate β 0 and β 1 from ˆx 1,..., ˆx n using least-squares; Compute ˆx n+1 = ˆβ 0 n 2 ˆβ. 1 n This kind of procedure is called certainty equivalent control in the control and econometrics literature. Very hard to show convergence! Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

10 Learning the dependence Can we learn the structure of the dependence of F on x, and simultaneously optimize the objective function? For example, consider the simple case where Y n = β 0 + β 1 x + ε n, where β 0 and β 1 are unknown constants and {ε n } is i.i.d. G(x, Y ) = xy. At iteration n, Estimate β 0 and β 1 from ˆx 1,..., ˆx n using least-squares; Compute ˆx n+1 = ˆβ 0 n 2 ˆβ. 1 n This kind of procedure is called certainty equivalent control in the control and econometrics literature. Very hard to show convergence! Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

11 Can we use stochastic approximation? We could apply an SA-type procedure to the function g(x) = E[G(x, Y )]. We cannot get derivatives estimators if we don t know how Y depends on x. Even if we know the dependence form, it may be hard to prove convergence. Again, suppose Y = β 0 + β 1 x + ε, G(x, Y ) = xy. Then, g (x) = β 0 + 2β 1 x. We can estimate β 0 and β 1 from regression, but does it converge? We could use finite differences, but convergence is slow. Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

12 Can we use stochastic approximation? We could apply an SA-type procedure to the function g(x) = E[G(x, Y )]. We cannot get derivatives estimators if we don t know how Y depends on x. Even if we know the dependence form, it may be hard to prove convergence. Again, suppose Y = β 0 + β 1 x + ε, G(x, Y ) = xy. Then, g (x) = β 0 + 2β 1 x. We can estimate β 0 and β 1 from regression, but does it converge? We could use finite differences, but convergence is slow. Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

13 Can we use stochastic approximation? We could apply an SA-type procedure to the function g(x) = E[G(x, Y )]. We cannot get derivatives estimators if we don t know how Y depends on x. Even if we know the dependence form, it may be hard to prove convergence. Again, suppose Y = β 0 + β 1 x + ε, G(x, Y ) = xy. Then, g (x) = β 0 + 2β 1 x. We can estimate β 0 and β 1 from regression, but does it converge? We could use finite differences, but convergence is slow. Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

14 Can we use stochastic approximation? We could apply an SA-type procedure to the function g(x) = E[G(x, Y )]. We cannot get derivatives estimators if we don t know how Y depends on x. Even if we know the dependence form, it may be hard to prove convergence. Again, suppose Y = β 0 + β 1 x + ε, G(x, Y ) = xy. Then, g (x) = β 0 + 2β 1 x. We can estimate β 0 and β 1 from regression, but does it converge? We could use finite differences, but convergence is slow. Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

15 Comparing the methods β 0 = 5, β 1 = 1, ε =Normal(0,1) x n CEC SA FD SA reg Opt n x 10 4 Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

16 Issues Is it better to first learn the dependence structure, and then optimize? (e.g., Besbes and Zeevi 2010 in the context of demand functions) Alternatively, we can view the problem as a black-box system where we only observe outputs (in this case, revenues) in terms of the inputs (the decisions). We want to optimize however, in the context we are interested in, function evaluations are expensive. We are looking at methods that choose decision points randomly according to some distribution, which is updated from iteration to iteration. Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

17 Issues Is it better to first learn the dependence structure, and then optimize? (e.g., Besbes and Zeevi 2010 in the context of demand functions) Alternatively, we can view the problem as a black-box system where we only observe outputs (in this case, revenues) in terms of the inputs (the decisions). We want to optimize however, in the context we are interested in, function evaluations are expensive. We are looking at methods that choose decision points randomly according to some distribution, which is updated from iteration to iteration. Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

18 Issues Is it better to first learn the dependence structure, and then optimize? (e.g., Besbes and Zeevi 2010 in the context of demand functions) Alternatively, we can view the problem as a black-box system where we only observe outputs (in this case, revenues) in terms of the inputs (the decisions). We want to optimize however, in the context we are interested in, function evaluations are expensive. We are looking at methods that choose decision points randomly according to some distribution, which is updated from iteration to iteration. Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

19 Issues Is it better to first learn the dependence structure, and then optimize? (e.g., Besbes and Zeevi 2010 in the context of demand functions) Alternatively, we can view the problem as a black-box system where we only observe outputs (in this case, revenues) in terms of the inputs (the decisions). We want to optimize however, in the context we are interested in, function evaluations are expensive. We are looking at methods that choose decision points randomly according to some distribution, which is updated from iteration to iteration. Tito Homem-de-Mello (UIC) Learning with Decision-Dependent Uncertainty NSF Workshop May / 7

The Perceptron algorithm

The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following