Florian Hoffmann. September - December Vancouver School of Economics University of British Columbia

Lecture Notes on Graduate Labor Economics Section 3: Human Capital Theory and the Economics of Education Copyright: Florian Hoffmann Please do not Circulate Florian Hoffmann Vancouver School of Economics University of British Columbia September - December 2017 0

1 Introduction As we have seen, there is substantial wage and income inequality between individuals at a point in time and within individual over time. Understanding the factors explaining these inequalities is one of the central questions of the social, and to some extent the natural, sciences. The primary distinction between an economists approach to this question is the focus on individual, rather than social factors. Rather than imposing the hypothesis that such inequalities are driven by factors such as discrimination, social environment and social networks, cronyism etc., labor economics first tries to explore how much of wage inequality can be explained by differences in individual-level factors and decisions. The central statistical model in this literature is the Mincer earnings equation, which is a regression of individual wages or income on measures of human capital, such as years of education and years of labor market experience. This puts human capital front and center. But what exactly is "human capital"? Acemoglu and Autor write in their lecture notes posted on the course website that "loosely speaking, human capital corresponds to any stock of knowledge or characteristics the worker has (either innate or acquired) that contributes to his or her productivity". In light of recent research on human capital formation in the context of life-cycle models, this definition is quite imprecise. In dynamic economic models, capital is a stock variable that evolves according to some law of motion, and the crucial flow variable is investments. Investments in turn are uses of resources that come with costs in the current period and expected gains in the future. Thus, the stock depends on past investments and should not be viewed as "innate". By definition, capital is something acquired. Furthermore, knowledge and human capital are not necessarily the same thing. To see this, consider the following example. We assume that wages depend positively on some skill, say α, which varies in the population. The skill may depend positively on some stock of human capital, such as schooling. Denote this stock variable by S and assume that α = f(s) + ε for some positively sloped concave function f(s) and some random variable ε that stands in for skill heterogeneity beyond f(s). The stock S can be expanded by investing I units of time per period in some resource, for example an indicator for whether an individual goes to school in a period. A simple accumulation process for human capital is S = g (S) + θ I, (1) where g (S) is a function that may capture depreciation. The variable θ in turn is assumed to be heterogeneous in the population as well. We may assume that the two variables that are random in the population conditional on S, namely ε and θ, have joint distribution H (ε, θ). Heckman, Lochner and Taber (1998) call these two random variables the ability to earn and the ability to learn respectively. Indeed, conditional 1

on schooling the skill ε determines the earnings level while the rate at which individuals add to their stock of human capital is governed by the rate of learning θ. Assuming that wages reflect productivity, we have w = f(s) + ε, (2) which brings us back to the central concepts in modern human capital theory. First, knowledge and the stock of human capital are not necessarily the same. The stock of human capital may increase the stock of knowledge relevant for a job, but it is not the knowledge itself. Schooling may increase mathematical knowledge, which makes one better suited for quantitative or analytical jobs, but it is not identical to this knowledge. How human capital maps into skills is in fact an open research question. Traditionally, we have equalized schooling with skills, thereby setting f(s) = S, but this is just because measuring the true relationship f(s) between the human capital stock and skills is diffi cult. Second, innate skills are not human capital. They are the random variables (ε, θ) and may determine how human capital investments map into human capital stocks via heterogeneity in the ability to learn. They also capture productivity dispersion over and above dispersion in the stock of human capital via heterogeneity in the ability to learn. Modern human capital theory is very careful about the distinction between skills and human capital. Keep this in mind if you decide to do research in this area. There are some further issues that makes human capital theory interesting and quite different from models of physical investments. Let me point out two of them. First, human capital is inherently tied to humans. It cannot be freely traded in markets, which differentiates it from physical investments. As a consequence, it cannot be collateralized. If we asked an investor to finance our human capital investments, we could just run away with the money. This has important implications for the shape of the borrowing constraints and for public policy. Second, human capital investments may be specific to some job or task. That is, they may be valuable for the job we currently have or expect to have in the future, but completely useless for another job. We could get trained in carpentry, which would make us a better carpenter but not a better economist. How transferable human capital is remains an open question. It is not even clear what it is tied to - general job requirements, occupations, firms, or tasks etc. 2 The Returns to Education 2.1 Continuous Choices We start with discussing the most intuitive form of human capital investments: Schooling. This already points out some of the most important aspects of human capital theory. Here we will closely follow Card s (2001) Econometrica-paper that summarizes the main advances and ideas in the vast literature on the 2

economics of education. This paper discusses the main issues in the estimation of returns to education and covers various empirical approaches, with an emphasis on IV. It is a great and important paper you should read line by line. It is convenient to view schooling as a continuous choice variable and to assume that working and schooling are mutually exclusive alternatives. Since schooling requires time this implies that we should use a continuous time environment. Individuals need to choose time spend in school S that maximize life-cycle utility subject to a life-cycle budget constraint. To simplify matters we assume that there are no borrowing constraints and no uncertainty. Furthermore, when in school, individuals receive earnings that exactly cover the monetary costs of education. Once they exit schools, earnings y(s) depend only on schooling. Preferences are described as follows: { } log C(t) ϕ(t) if t S U(C, t) = log C(t) if t > S If the horizon is assumed to be infinite and the discount rate is ρ, individuals solve S max [log C(t) ϕ(t)] e ρt dt + log C(t) e ρt dt (4) C(t),S 0 s.t. 0 C(t) e Rt dt = S S (3) y(s) e Rt dt, (5) where R is the interest rate on loans. Notice the lack of borrowing constraints here. An individual receives positive net earnings only from period S on but consumes in all periods. Furthermore, because we assume that earnings y(s) do not directly depend on age, we have S y(s) e Rt dt = y(s) e RS R. In the next homework assignment you are required to show that the optimal schooling choice satisfies y (S) y(s) = R + ρ e ρs ϕ(s). (6) You will also be asked to show that without borrowing constraints we have full "separation" between the schooling and the consumption decisions. In particular, the solution of the problem above is equivalent to choosing S that maximizes life-cycle earnings and then to select the optimal path of consumption. that To map this into an econometric model with heterogeneity across individuals indexed by i, Card assumes log [y i (S)] = α i + b i S 0.5 k 1 S 2 i (7) so that y i (S) y i(s) = b i k 1 S. In the language of Heckman et al. this model features an ability to earn α i and an ability to learn b i, the latter of which is captured by the reduced-form parameter b i. Let s briefly digress into the definition of a "return". An interest rate is the percentage change in wealth per unit of capital. It is thus a semi-elasticity. The same is true for the returns to education. They are an interest rate paid per unit of schooling, that is a semi-elasticity. This is the expression y i (S) y i(s). 3

Using the approximation R i +ρ e ρs ϕ i (S) r i +k 2 S we get a closed-form solution for the optimal schooling level of individual i: S i = b i r i k 1 + k 2. (8) It is important to note that (α i,b i, r i ) are random in the population and should be assumed to be correlated. If k 2 = 0 and r i = r, then the only determinant of schooling is b i, the heterogeneity in returns. This case is often referred to as "equality of opportunity" because conditional on skills, everybody makes the same education choices. The model is silent about the sources of heterogeneity in skills and assumes it is "innate". If the cost parameter r i is heterogeneous in the population, then the population distribution of S i conditional on b i will be non-degenerate unless corr(b i, r i ) = 1. Now suppose we want to estimate the returns to education, y i (S) y i(s) β i. For most individuals, leaving school is a terminal state, so that there is no longitudinal variation in S i that can be used to identify these returns. Hence, identification will rely on cross-sectional variation. Drawing a random sample from the population, the average return is E [β i ] = E [b i k 1 S i ] = E [b i ] k 1 E [S i ]. This is the expected increase in average log earnings if a random sample of the population acquires an additional unit of education and corresponds to the "Average Treatment Effect" (ATE) of schooling. This is not necessarily the relevant marginal return for evaluating any particular schooling intervention: Every such intervention will potentially target a different group that may be a non-random draw from the population distribtion of β i. But what are we identifying when running an OLS-regression of log-earnings on a 2nd-order polynomial in schooling? The "true" regression equation can be written as follows: log [y i (S)] = α i + b i S 0.5 k 1 Si 2 = [ α 0 + b S i 0.5 k 1 Si 2 ] [ + ai + ( b i b ) ] S i (9) where a i = λ 0 ( S i S ) + u i (10) ( bi b ) = ψ 0 ( S i S ) + v i (11) are linear projections of the unobserved skill-heterogeneity parameters ( a i, ( b i b )) on observed schooling. This is a linear approximation to the policy function (8) and captures the endogeneity of schooling choices. The first bracket in the rewritten regression equation does not contain any unobserved heterogeneity while the second part is unobserved and will introduce biases and inconsistencies into the OLS-estimator of the ATE in the sense that p lim( b OLS ) b. There are two sources of biases here. For signing these biases, assume that cov(a i, b i ) 0, cov(a i, r i ) 0, cov(b i, r i ) 0. The first bias is the conventional "ability bias" arising from λ 0 0. With our assumptions about the covariances between the different heterogeneity components, it is clear that λ 0 > 0. In this case individuals who always earn more than E [log y i S] - i.e. those who are at the top of the earnings distribution conditional on any level of schooling - also tend to 4

choose more education. As a consequence, the ATE will be estimated with an upward bias. The second bias is the "comparative advantage bias" that arises if ψ 0 0. Again, with our assumptions this bias is positive (ψ 0 > 0). Intuitively, those observed to have high levels of education are also those who are most likely to have the highest returns to education. This is a pure incentive effect driven by heterogeneity in the ability to learn. Since OLS will be biased, we may consider estimating the ATE using an IV-approach. If we find instrumental variables Z that are independent of a i and b i and that directly affect schooling S, then we can estimate b consistently. The three families of instruments that are popular are: 1. College Proximity, since it affects r i and since it may be independent of (a i, b i ) once one conditions on a rich set of parental background variables. 2. Quarter of Birth, since it affects the age of school entry and thus the age of school leaving age. 3. Compulsory schooling laws, since it forces individuals to take a minimum amount of schooling and since it should be exogenous to the individual. There is a sizeable literature that attacks each of these IV s. First, college proximity may interact with the incentive effects generated by (a i, b i ) in non-trivial ways that may not be captured in the linearregression approach. In particular, Card shows that the effect of IQ is stronger among men who live near a college than among those who do not. One reason may be that the location of colleges is endogenous in the sense that they are opened where demand is largest. One may think of a case where cov(r i, b i ) < 0 in regions where there are colleges while the opposite is the case in regions where there are none. Second, quarter of birth may be endogenous if parents who tend to invest more into the cognitive abilities of their children also time their birth. There is indeed evidence that supports this hypothesis. Furthermore, conditional on education, younger individuals (those born in a later quarter) underperform relative to their older classmates, suggesting that the instrument will be correlated with (a i, b i ). Third, compulsory schooling laws are state-level or country-level laws that may have major equilibrium effects on the returns to education. Also, the problems with LATE discussed below may be particularly severe for this instrument. The discussion of the model above suggests that it is reasonable to assume that the OLS-estimator is biased upward. Hence, a-priori we would expect that IV-estimates are significantly smaller than OLSestimates. However, in practice the opposite is true. Seminal research by Imbens and Angrist (1994) analyzed why this may be the case. Consider a bivariate instrument Z {0, 1} that affects one of two otherwise identical populations in such a way that S i increases by S i among those who are treated (Z i = 1). The marginal monetary return of this intervention for individual i is approximately log y i β i S i. 5

For a bivariate instrument we know that p lim( b IV ) = cov(log y i, Z i ) cov(s i, Z i ) = E[log y i Z i = 1] E[log y i Z i = 0]. (12) E[S i Z i = 1] E[S i Z i = 0] Under the assumptions for a valid IV, we have that E[log y i Z i = 1] = E[log y i Z i = 0] + E [β i S i ] and E[S i Z i = 1] = E[S i Z i = 0] + E [ S i ]. Hence, we get p lim( b IV ) = E [β i S i ] E [ S i ] which is equal to the ATE only if E [β i S i ] = E [β i ] E [ S i ]. This is the case in which the treatment induces exactly the same choices for all individuals. However, if the response is heterogeneous, then β i and S i will not be independent. In this case, the IV-estimator will put an overproportional weight to those that are most affected by the instrument. In this sense, it identifies a local effect. For example, when using compulsory schooling laws, those who are affected by the policy intervention are those who would have dropped out of school in the counterfactual regime. The returns to education in this subpopulation are likely to be very different from those in the whole population. However, because these IV are the only individuals who are affected by the instrument, b identifies the returns to education for this sub-population. Hence, it is a "local" effect and thus referred to as Local Average Treatment Effect (LATE). As Card points out, this motivates relying on structural models, where one spells out the heterogeneity and then derives structural policy functions that explicitely depend on heterogeneity. In this case one can then use the policy functions to identify the returns to education for any sub-population. After all, the structural decision rules will tell us exactly what the optimal choice and the returns are conditional on observed and unobserved characteristics. Since education is arguable a forward-looking decision - there would be no tuition fees in equilibrium if individuals were not expecting any monetary returns from education (unless they derive positive net utility from going to school) - we will need to formulate a dynamic model. Furthermore, education is a discrete choice, and earnings are observed in the data. This leads us to a dynamic version of the Roy model with observable payoffs. (13) 2.2 Estimating Heterogeneous Returns to Education using a DDC-Model (Belzil & Hansen, Econometrica 2002) We discuss a much simplified version of Belzil and Hansen (2002), which is an econometric structural model of the education choice and payoffs. To abstract from a non-trivial savings decision we assume that agents are risk-neutral and derive (dis-)utility from going to school ϕ(.). Preferences are thus { } C(t) ϕ(t) if t S U(C, t) =. (14) C(t) if t > S 6

Given risk-neutrality, we may assume that individuals consume their current earnings: C(t) = W (t, S) (15) where W (.,.) are earnings conditional on schooling and experience. Time is discrete and finite, with t {0, 1, 2,..., T } denoting age, and individuals decide about whether to stay in or leave school at the beginning of the period. In contrast to Belzil and Hansen we also assume that there is no unemployment or non-participation, and that individuals do not return to school once they have left it. This is a strict assumption used to simplify our analysis. It also implies that we will not be able to identify the disutility of work since we never observe an individual who is leaving the labor force. We therefore assume that workers do not derive any (dis-)utility from working. In the actual estimation, these assumptions would need to be relaxed to turn the model into a realistic econometric framework. It remains to specify the stochastic processes for ϕ(t), W (t, S). We assume that ϕ i (t) = ν i for all t S (16) { exp {X W i (t, S) = i δ + ζ } it} Wi s (t) if t S exp {f(s i, θ) + g(ex, γ) + α i + ε it } Wi e(t, S (17) i) if t > S where f(s i, θ) is a flexible parametric function of schooling and g(ex, γ) is a polynomial in labor market experience, with ex = t S. Note that the returns to schooling are not individual specific, but depend on schooling. That is, conditional on S there is no heterogeneity in the ability to learn, only heterogeneity in the ability to earn. In particular, f(s i, θ) is allowed to be any parametric function whose parameters will be estimated. Note however that in the sample and in the population the non-linearity of f(s i, θ) in schooling may generate differences in the returns to schooling across individuals since different individuals will choose different levels of education. The terms (ν i, α i, ε it, ζ it ) capture ability heterogeneity that is unobservable to the econometrician. We assume that ε it N ( 0, σ 2 ε) ; ζit N ( 0, σ 2 ) ζ (18) Pr (ν i = v m ; α i = α m ) = π m (19) M π m m=1 = 1. Hence, we assume that type-heterogeneity is discretly distributed with M types. As a result, the model will feature a mix of discrete and continuous random variables and is thus a structural mixture model, where π m are the mixing probabilities. The number of types M needs to be set empirically, although it is hard to test for the optimal number of types. Likelihood ratio tests that compare the likelihood between two models that only differ with respect to the number of types are generally biased. An unbiased test has been derived only recently by Kasahara and Shimotsu (2016). However, it is common to just set M exogenously. 7

But why are we assuming that (ν i, α i ) have a discrete distribution? Since this is a non-linear model, individual fixed effects cannot be estimated consistently, and they cannot be eliminated by first differencing like in a linear model. This leaves us with a random effects model. The Dynamic Programming below will force us to discretize the support of the distribution of (ν i, α i ) since otherwise we may have to compute the value functions an uncountably infinite number of times - once for each possible (ν i, α i ). The alternative is to assume (ν i, α i ) to be discrete random variables. This also has the advantage that we are not imposing any functional form assumptions on its distribution: the π m will be estimated with only one restriction - they need to sum up to one. To solve the individuals decision problem, suppose there is a compulsory schooling level of s years. Up to this level, there is no decision to be made. Subsequently, the individual needs to choose whether to stay in school or not, anticipating that once it leaves it will not return. This is therefore an optimal stopping problem. We assume that it maximizes expected life-cycle utility E T t=0 βt U(C, t). At age t the state of an individual is given by Ω it = (S i, X i, α i, ν i, ε it, ζ it ) (S i, it ). The value of exiting given the state is V e t (S, ) = exp {f(s, β) + g(ex, γ) + α + ε} + E [ V e t+1 (S, +1 ) ] (20) ex it = t S = 0 upon exit where V e t+1 (S i, +1 ) is defined analoguesly. This is just the expected discounted value of the future earnings stream given education S. Notice that we do not index the state by i. This is because the value functions are defined on a state space that does not depend on the sample. That is, given a value for the parameters we can solve the DP without knowing the sample. Only in constructing the likelihood we need to evaluate the value function at the observed state for each individual. Make sure you understand this distinction. The value of remaining in school is given by V s t (S, ) = exp {X δ + ζ} ν + β E [V t+1 (S + 1, +1 ) ]. (21) Note that for those who have stayed in school up to period t, we have S = t, so we can actually drop the index from this value function. The Bellman equation is then V t (S, ) = max {V e (S, ), V s t (S, )}. Although the decision problem is quite trivial, the DP is formidable: The state-space (Ω it ) t is so large that conventional methods will be numerically infeasible unless one uses super-computers. This is what can make working with DDC-models frustrating: As a labor economist, one wants to introduce as much observed and unobserved heterogeneity as possible, but numerically it becomes untractable quite quickly. One additional problem is that some of the state-variables are unobserable. In practice, most researchers solve the value functions on a small subset of the state-space and then interpolate between the state-points using linear or non-linear regressions. 8

Some of the assumptions above make the model fairly tractable econometrically. First, we have assumed that unobserved heterogeneity is distributed according to a discrete mixture with M types. Hence, the DP needs to be solved separately for each of the types - hence M times. This is an extreme way of discretizing a potentially continuous distribution. Usually, one chooses small M, e.g. M = 4. Second, the only additional unobserved component are the (ε it, ζ it ) which are assumed to be iid across individuals and time. The latter assumption implies that, conditional on all other state-variables, one cannot improve forecasts about future income by conditioning on (ε it, ζ it ). Furthermore, (ε it, ζ it ) do not affect the returns to schooling or the schooling level directly. Combining these two assumptions implies that E [V t+1 (S + 1, +1 ) ] does not need to condition on (ε, ζ). As we will see in a moment, this is a particularly powerful result. To highlight that it reduces the space over which the expectation function E [V t+1 (S + 1, +1 ) ] needs to ( ) be computed, we define =, ε, ζ. Now suppose that we condition on, essentially treating it as non-random. Then the expectation function only depends on these non-random conditioning variables, which is much simpler compared to the situation where contains random variables that are unobserved to the econometrician. Unfortunately, (ν, α) and they are random and unobservable. However, for now we keep the assumption that they are observable. The observable endogenous variables in the data for individual i are a sequence of observed earnings and the level of schooling, (W i (t, S i ), S i ) t = (W i1, W i2,..., W is, W is+1,..., W it, S i ). In contrast to the EFK-paper on intertemporal labor supply, the likelihood can be characterized, though it will need to be simulated. Start with the conditional choice probability of dropping out of school after S years. Conditional ( ) on it individuals obtain schooling level S i if they permanently leave school in t = S i, which implies t that they have decided to stay in all previous periods. Conditional on the observed earnings and the it up to t this event has probability P r(s i W i (τ, S i ),, τ {1, 2,..., t}) = (22) Wi s (τ) ν i + β E [V τ+1 (τ + 1, +1 ) ] Pr Vτ e (τ, it ), τ < S i ; V s (τ, it ) (23) [ < W i (τ, S) + E Vτ+1 e (τ, +1 ) ], τ = S i Wi s (τ) ν E [V τ+1 (τ + 1, +1 ) ] i + β [ E Vτ+1 e (τ, +1 ) ] = Pr exp {f(τ, θ) + g(ex, γ) + α i + ε it } ; τ < S i ; [ E Vτ+1 e (τ, +1 ) ]. (24) W i (τ, S) + β E [V τ+1 (τ + 1, +1 ) ] exp {X i δ + ζ } ν i, τ = S i ; Notice that we are now evaluating the value functions at the sample, so we are using indices here. There are several properties of this probability worth noting. First, it does not depend on wages after τ = S i : 9

These wages need to be forecasted at the time the schooling decision has to be made, and the information set consists of. Second, the probability conditions on schooling wages for all periods τ < S i and on labor market wages in period τ = S i. Third, the only random variables are the (ε it, ζ it ), which are iid across individuals and over time. Because of the functional form assumptions, we can isolate them on one side of the inequalities and then use their independence to factor the probabilities: P r(s i W i (τ, S i ),, τ {1, 2,..., t}) = ln Wi s (τ) ν E [V τ+1 (τ + 1, +1 ) ] i + β [ E Vτ+1 e (τ, +1 ) ] Pr [f(τ, θ) + g(ex, γ) + α i ] ε it ; τ < S i ; [ E Vτ+1 e (τ, +1 ) ln W ] i (τ, S) + β E [V τ+1 (τ + 1, +1 ) ] (X i δ ν i) ζ, τ = S i ; [( ) ] 1 = Φ (ln(.) [X σ iδ ν i ]) ζ [( ) ] Si 1 1 Φ (ln(.) [f(τ, θ) + g(ex, γ) + α i ]). (25) t=0 σ ε The remarkable fact is that the choice probabilities have a closed form conditional on and after having solved the Dynamic Programming Problem numerically. Essentially, conditional on the value functions this is the likelihood of a sequence of static Probit models. There are three essential assumptions that yield the convenient expression for the choice-probabilities. With convenient I mean that one can separate the inequalities characterizing optimality in such a way that the unobserved heterogeneity can be written on the LHS, and the rest of the variables, including future expected values, can be written on the RHS. These three assumptions are: 1. Future states it are conditionally independent of (ε it, ζ it ). 2. (ε it, ζ it ) are iid across workers and time. 3. Additive log-separability of (ε it, ζ it ) in the payoff equations. The first two assumptions imply that future values cannot be predicted using the current random state. This means that conditional on the expected continuation values are deterministic functions rather than random variables. We already accounted for that by dropping (ε it, ζ it ) from the state-space of these expectations. Assumption 1 enters the model because all observed state-variables do not depend on(ε it, ζ it ). The third assumption uses the separability of the unobserved term to separate it from the deterministic functions. To read more about the importance of these assumptions, read the review paper of Dynamic Discrete Choice models by Aguirregabiria and Mira in the Journal of Econometrics and by Keane, Todd and Wolpin in the Handbook of Labor Economics. 10

We need to continue with deriving the likelihood. The choice probability above summarizes all the endogeneity of schooling. The likelihood of the history of observed schooling earnings is just the product of the conditional earnings densities up to period (S i 1), where we condition on type m and the observables: ( ) Si 1 l i1 ( t=0 Here, g s (.) is the density of residual schooling earnings., m) = S i 1 t=0 gs ( W s i (t) ) likelihood is also just the product of conditional labor market earnings: Similarly, from period (S i + 1) onwards, the ( ) T l i3 (, m) = T ( t=s i+1 t=s ge Wi e (t, S i ) ). (26) i+1 In period t, we observe both, the individual leaving school and labor market earnings: l i2 ( isi, m) = P r(s i.) g e ( W e i (S i, S i ) isi ) (27) For the rest of an individuals observed history, the only endogenous variable are wages, which have periodt density g e ( W e i (t, S i) ). Conditional on types the likelihood contribution of an individual is given by l i (m) = l i1 l i2 l i3. (28) Notice that the expressions are fairly convenient. Important here is the trick of conditioning the choice probabilities on observed payoffs rather than vice versa, so that the density of payoffs shows up treating S and ex as exogenous. We have used this before when writing down the likelihood for the Roy-model. The remaining problem is that we are conditioning on the type, even though it is an unobserved state variable. We therefore need to integrate it. Given the discreteness of the distribution over types, this is straightforward: This is a discrete version of a random effects model. Hence, the integration of unobserved heterogeneity to get an individual s actual likelihood contribution is simply a sum: L i = M m=1 π m l i (m). (29) This implies that for every individual we need to compute the likelihood contribution M times and then compute the weighted sum, with π m as weights. Since the π m are left unrestricted (other than M m=1 π m = 1) this is also often referred to as Non-parametric Maximum Likelihood. How to estimate the model? We just need to modify the algorithm we have used in the Minimum Distance Estimation at the end of the section on labor demand. The resulting Nested Fixed Point Algorithm works as follows: 1. Guess an initial value for all structural parameters of the model. 2. Inner Loop: 11

(a) Solve the Dynamic Programming Problem using backward induction. This needs to be performed M times, once for each type. (b) Compute the choice probabilities. The expected future values are treated as an additional observable variable since they are a deterministic function of the observable state variables and the type m. (c) Evaluate the log-income density given the parameter values and the observed state, including the type. (d) Compute l i (m) and then L i by integrating out the types. 3. Outer Loop: Maximize the likelihood with respect to the structural parameters by repeating steps 1 and 2 until convergence, updating the guess in step 1 each time step 2 has been completed. For the numerical update, one may use a derivative-based Newton-Raphson type algorithm if the choice probabilities are differentiable in the structural parameters. This is the case here (it would not if we simulated choice probabilities using a crude-frequency simulator). The results of the paper are interesting. First, the function f(s i, θ) is highly non-linear and convex in schooling. Returns to education are actually statistically up to grade 12, implying that the timing of a high-school dropout does not matter for wages. They are largest at grade 14. As a consequence, average returns can never be equal to local returns to education. Second, average returns estimated from the structural model are at most 7%, compared with an OLS-estimate of about 10%. Hence, the structural estimates are significantly smaller than reduced-form estimates. This implies that there are significant ability biases (hence, OLS is biased) and that the differences between LATE and ATE are large (hence, the IV estimates are large). Third, unobserved heterogeneity is important for explaining both, wage differential across individuals and choices. In particular, those with low dis-utility of schooling (ν i is small, or even negative) have high market abilities (α i is large). Hence, those who tend to take more education are also those who have above average wages no matter the schooling level. This is the source of ability bias in reduced-form estimates. Given the convexities in f(s i, θ), this also implies that those with high market abilities have high local returns. 12