Panel data can be defined as data that are collected as a cross section but then they are observed periodically.

Panel Data Model Panel data can be defined as data that are collected as a cross section but then they are observed periodically. For example, the economic growths of each province in Indonesia from 1971-2009; or the profit of companies listed in ISX observed from 1991-2009

Panel data can be very useful for researchers who are interested in analyzing something that can not be done using either time series data or cross section data only. For example, we would like to develop a model that can explain the variations regional economic performance of provinces in Indonesia through their natural resources and productivity of their human resources. If we estimate the model using crosssection data that are observed only in one particular year, we can not say anything about the variation of their growths over the last ten years.

By using panel data, researchers can analyze the fluctuations of economic performance over years as well as the variations of economic performance across provinces in some particular year. Since the panel data is a combination of cross section and time series data, the observations are very large. In additions, characteristics of time series data and cross section data are merging into panel data characteristics. This situation can be advantages or disadvantages for researchers. That is why we need a special treatment for estimating panel data model.

Model Representation Model with cross section data Y i = α + β X + ε i i. ; i = 1,2,.., N N: number of cross section observations Model with time series data Y t = α + β X t + ε t ;t =1,2,.,T T: number of time series observations Model with panel data Y it = α + β X it + ε it ; i =1,2,..,N; t =1,2,..,T N.T: number of panel data observations.

Estimation of Panel Data Model There several techniques available 1. OLS (Pooled Data) This technique is to be used when the data is just combining cross-section data and time-series data and this data combination (Pooled Data) is treated as new set of data without taking any consideration of cross-section and time-series behaviors.

Estimation of Panel Data Model 2. Fixed Effect This approach assumes that all individual characteristics as well as cross-section specifics are captures in the intercepts. Therefore, in this approach, the intercept can change across individual or over time or in both directions.

Estimation of Panel Data Model 3. Random Effect This approach assumes that all individual characteristics as well as cross-section specifics are captures in the residuals. Therefore, in this approach, the residual has individual component, time-series component and both components.

Observe the following representation: Y it = α + β X it + ε it ; i=1,2,..,n; t=1,2,..,t If cov(ε it, ε jt ) = 0; cov(ε it, ε i,t-1 )=0; E(ε it )=0; and Var(ε it )=σ 2, then,

we can estimate the model by separating its time component so that we have T regressions each having N observations. Or: Y i1 = α + β X i1 + ε i1 ; i=1,2,..,n Y i2 = α + β X i2 + ε i2 Y it = α + β X it + ε it

Analogously, we can estimate the model by separating its cross-section so that we have N regressions each having T observations. Or: i = 1 ; Y 1t = α + β X + ε 1t 1t ; t=1,2,..,t i = 2 ; Y 2t = α + β X + ε 2t 2t ; i = N ; Y Nt = α + β X Nt + ε Nt ;

For Pooled Data approach, we assume that α (intercepts) and the residuals are constants across individual and over time. Sometime, this assumption is not a realistic one. Therefore, we will consider the models that makes intercepts or residuals change over time and across individual.

Fixed Effect Model (FEM) In this model, variations of individual and over time is captured in the intercepts. To formulate this, see the following: Y it = α + γ 2 W 2t + γ 3 W 3t +..+ γ N W Nt + δ 2 Z i2 + δ 3 Z i3 +..+ δ T Z it + β X it + ε it

W it and Z it are dummy variables and defined as: W it = 1 ; for individual i; i= 1,2, N = 0 ; others. Z it = 1 ; for period t; t= 1,2, T = 0 ; others. If the model is estimated using OLS, we will obtained an unbiased and consistent estimator.

Remarks: 1. The model has N+T parameters that consists of: (N-1) parameters of γ (T-1) parameters of δ 1 parameter of α 1 parameter of β 2. The degrees of freedom is: N.T N - T

Regression Equations on FEM i = 1 ; t=1; Y 11 = α + β X 11 + ε 11 t=2; Y 12 = (α +δ 2 ) + β X 12 + ε 12. t=t; Y 1T = (α +δ T ) + β X 1T + ε 1T

i = 2 ; t=1 ; Y 21 = ( α +γ 2 ) + β X 21 + ε 21 t=2 ; Y 22 = ( α +γ 2 +δ 2 ) + β X 22 + ε 22 t=t ; Y 2T = (α +γ 2 +δ T ) + β X 2T + ε 2T i = N ; t=1 ; Y N1 = (α + γ N ) + β X N1 + ε N1 t=2 ; Y N2 = (α + γ N + δ 2 ) + β X N2 + ε N2 t=t ; Y NT = (α + γ N + δ T ) + β X NT + ε NT

To investigate whether α is constants for all i and t, do the following test: F={(RSS OLS RSS MET ) / RSS MET }.{(NT-N-T) / (N+T-2)} If F calculated > F from table, then H 0 is rejected, and it means that FEM is better. The next question is: How to interpret all the parameters?

Random Effect Model (REM) In FEM, variations of individual and times are accommodated in the intercepts such that the intercepts changed over time and across individual. In the meantime, variations of individual and times are accommodated in the residuals for REM. In this case, the random error is composed into error of individual component, error of time component and error for both. REM can be represented as:

Y it = α + β X it + ε it ; ε it = u i + v t + w it u i : error for cross-section v t : error for time-series w it : error for both With the assumption: u i N (0, σ u2 ); v t N (0, σ v2 ); w it N (0, σ w2 )

Therefore, on average, deviation effect for time series is randomly represented by v t while deviation effect of cross-section is randomly represented by u i. For REM, Var (ε it ) = σ u2 + σ v2 + σ 2 w For OLS (Pooled Data), Var (ε it ) = σ 2 w

So, REM can be estimated using OLS if σ u2 = σ v 2 = 0. Otherwise, REM is estimated using Generalized Least Square method that consists of 2 stages. I (i) Estimate REM using OLS. (ii) Calculate RSS to estimate sample variance II By using sample variance estimated at the first stage, use GLS to estimate parameters of the model.

Remark: If we can assume that the error is normally distributed, then MLE can be used.

FEM vs REM Which one should we choose? (i) The parameters of REM are less; so it has bigger degrees of freedom. But FEM has capabilities to differentiate individual effects and time effects. (ii) There is a suggestion: If T > N use FEM If N > T use REM (iii) Use a statistical test, instead

Example 1 To analyze a cost function of an industry, it was observed costs and outputs from 4 companies over a ten-year period. The cost function is estimated using FEM approach: C it = α + γ 2 W 2t + γ 3 W 3t + γ 4 W 4t + β Q it + ε it

C it Q it : total cost of a company i at time t : total output of a company i at time t W = it 1; for a company i ; i =2,3,4 = 0; other The estimated model: C it = 2.315 + 10.110 W 2t + 2.385 W 3t + 16.171 W 4t + 1.119 Q it

Comment: How to interpret the intercept? For company 1, if Q 1 = 1000, then, C 1 = 1121.315 For company 2, if Q 2 = 1000, then, C 2 = (C 1 + 10.110) For company 3, if Q 3 = 1000, then, C 3 = (C 1 + 2.385) For company 4, if Q 4 = 1000, then, C 4 = (C 1 + 16.171)

How to interpret the slope? If the output is increased by 1 unit, then, the cost will increase by 1.119 unit for companies 1, 2, 3 or 4. Which company is the most cost efficient?

Example 2 Relationship between R&D Budget and Number of Products patented. There are several companies that spent a lot of many for Research and Development (R&D) expecting that new more efficient innovation / technique invented. To investigate whether there a positive relation between the budget of R&D and patents invented, it was observed 45 companies over 7 years in the US.

P RND : number of inventions patented (in log) : budget of R&D, 5 years ago (in log) The model offered: P it = β 0 + β 1 RND i,t-5 + ε it ; i: company; t: time Using 315 observation (45 companies over 7 years): P it = 1.438 + 0.845 RND i,t-5 t: (14.01) (24.17) R 2 = 0.65

Observations: 1. The estimated model indicates that there is a positive relation between budget of R&D and number of inventions patented. 2. On average, for every 1% increased in R&D, number of inventions patented will increase by 0.845%.

To analyze more on this relationship, the following is the estimated equation from regressing the average budget of R&D on the average invention (over a seven-year period): The estimated equation: P i = 1.370 + 0.871 RND i (5.53) (10.28) R 2 = 0.71

Observations: 1. There is a positive relationship between spending on R&D and invention patented 2. For every 1% increased in R&D spending, number of inventions patented increase 0.87%. 3. However, this model can not distinguished the variations among number of inventions patented across individual that not caused by spending on R&D. 4. Need to develop a model that can be used in analyzing number of inventions that are not caused by R&D spending across individual company.

Estimation based on FEM approach: P it = β 0 + β 1 RND i,t-5 + W it γ i + ε it Estimated Equation: P it = 0.195 RND i,t-5 t: (2.35); R 2 = 0.937 Since there are 45 different intercepts, it is not written explicitly. However, based on both F and t tests, all parameters are both jointly and individually significant.

For comparison, REM is also estimated and the estimated model is: P it = 2.299 + 0.519 RND i,t-5 t: (12.13) (8.78) R 2 = 0.91

From 4 different approaches we have tried, each gives different result and thus different interpretation. Since the data we used is panel data, we should use either FEM or REM. The choice can be guided by the objective of the analysis. If we really want to know the impact of other than R&D spending to the number of inventions patented across companies, FEM could be used.

However, Hausman Specification Test can be used to investigate whether the residuals are not correlated with the regressor as required in REM.

Remark: For this example, based on Hausman Test, requirement that the residuals are not correlated with regressor can not be fulfilled. So, for this example, FEM is more appropriate. Therefore, the analysis and model interpretation should be based on FEM.

Example 3 To analyze the cost function from an automobile industry, it was observed costs and outputs from 4 companies (let say: Toyota, Honda, Suzuki, and Kia) over a ten-year period. The cost function is represented by (using FEM approach):

C it = α + γ 2 W 2t + γ 3 W 3t + γ 4 W 4t + β Q it + ε it C it : total cost of a company i at time t Q it : total output a company i at time t W = it 1; for a company i; i = 2 (Honda), 3 (Suzuki), 4 (Kia) = 0; other (Toyota)

The estimated cost function (all parameters significant at α = 5%): C it = 16,171 2,385 W 2t - 2,315 W 3t + 10,110 W 4t + 1,119 Q it

From the estimated cost function, answer the following questions: (i) Which companies is the most cost-efficient? Why? (ii) For Suzuki, for example, what is the cost of producing 1000 units? Explain (iii) For Toyota, for example, what is the cost of producing 1000 units? Explain (iv) Which companies is the least cost-efficient? Why?

Which one is a proper estimator? I. Estimation using OLS P it = 1.438 + 0.845 RND i,t-5 t: (14.01) (24.17) R 2 = 0.65

II. Averaging over t, and using OLS for estimation P i = 1.370 + 0.871 RND i t: (5.54) (10.28) R 2 = 0.71

III. Estimation with FEM P it = β 0 + β 1 RND i,t-5 + W it γ i + ε it Estimate: P it = 0.195 RND i,t-5 t: (2.35); R 2 = 0.937

IV. Estimation with REM P it = 2.299 + 0.519 RND i,t-5 t: (12.13) (8.78) R 2 = 0.91