Regression Model with Panel Data

June 2, 2020 Ajeng Prastiwi

26 minute read

Introduction

Panel Regression

Panel data are also called longitudinal data or cross-sectional time-series data. A panel data set has multiple entities, each of which has repeated measurements at different time periods. Panel data may have individual (group) effect, time effect, or both, which are analyzed by fixed effect and/or random effect models.¹

Panel data examples:

Annual unemployment rates of each state over several years
Quarterly sales of individual stores over several quarters
Salary for the same worker, working at several different jobs.

Why should we use panel data?

Controlling for individual heterogeneity
Panel data give more informative data, more variability, less collinearity among the variables, more degress of freedom and more efficiency
Omitted variable bias.

Learning Objectives

The goal of this article is to help you:

Understand the concept of panel regression
Understand structure of panel data sets
Implement panel regression in business case.

Library and setup

library(dplyr)
library(lubridate)
library(plm)
library(AER)
library(gplots)
library(lmtest)
library(rsample)
library(MLmetrics)

Advantages of panel data

Panel data, by blending the inter-individual differences and intra-individual dynamics, have several advantages over cross-sectional or time-series data:

The obvious benefit is in terms of obtaining a large sample, giving more degrees of freedom, more variability, more information and less multicollinearity among the variables hence improving the efficiency of econometrics estimates ².
Interpretable models, we can interpret the regression coefficients in the framework of a cross-section and time-series effect.

Structure of Panel data sets

Panel data have three types of data, there are cross section, pooled cross section, and panel. To help you visualize these types of data we’ll consider some sample data sets below.

We use the Grunfeld datasets from AER package. This datasets tell about Grunfeld’s Investment Data on 11 large US manufacturing firms over 20 years, for the years 1935-1954.

data("Grunfeld")
head(Grunfeld)

#>   invest  value capital           firm year
#> 1  317.6 3078.5     2.8 General Motors 1935
#> 2  391.8 4661.7    52.6 General Motors 1936
#> 3  410.6 5387.1   156.9 General Motors 1937
#> 4  257.7 2792.2   209.2 General Motors 1938
#> 5  330.8 4313.2   203.4 General Motors 1939
#> 6  461.2 4643.9   207.2 General Motors 1940

This datasets have 20 annual observations on 3 variables for 11 firms, here some description of each feature:

invest : Gross investment, defined as additions to plant and equipment plus maintenance and repairs in millions of dollars deflated by the implicit price deflator of producers’ durable equipment (base 1947).
value: Market value of the firm, defined as the price of common shares at December 31 (or, for WH, IBM and CH, the average price of December 31 and January 31 of the following year) times the number of common shares outstanding plus price of preferred shares at December 31 (or average price of December 31 and January 31 of the following year) times number of preferred shares plus total book value of debt at December 31 in millions of dollars deflated by the implicit GNP price deflator (base 1947).
capital: Stock of plant and equipment, defined as the accumulated sum of net additions to plant and equipment deflated by the implicit price deflator for producers’ durable equipment (base 1947) minus depreciation allowance deflated by depreciation expense deflator (10 years moving average of wholesale price index of metals and metal products, base 1947).
firm: factor with 11 levels: General Motors, US Steel, General Electric, Chrysler, Atlantic Refining, IBM, Union Oil, Westinghouse, Goodyear, Diamond Match, American Steel.
year: Year.

Cross Sectional Data

Cross sectional data is a type of data collected by observing many subjects at the one point or period of time.

csd <- Grunfeld %>% 
  select(-firm) %>% 
  filter(year == 1954)
head(csd)

#>    invest  value capital year
#> 1 1486.70 5593.6  2226.3 1954
#> 2  459.30 2115.5   669.7 1954
#> 3  189.60 2759.9   888.9 1954
#> 4  172.49  703.2   414.9 1954
#> 5   81.43  365.7   804.9 1954
#> 6  135.72  927.3   238.7 1954

Pooled Cross Sectional Data

Pooled cross sectional data is multiple snapshots of multiple bunches of (randomly selected) individuals (or states or firms or whatever) at many points in time ³.

pcs <- Grunfeld%>% 
  select(-firm)
head(pcs)

#>   invest  value capital year
#> 1  317.6 3078.5     2.8 1935
#> 2  391.8 4661.7    52.6 1936
#> 3  410.6 5387.1   156.9 1937
#> 4  257.7 2792.2   209.2 1938
#> 5  330.8 4313.2   203.4 1939
#> 6  461.2 4643.9   207.2 1940

pcs is an example of a pooled cross-sectional data set because we have observations with different year. We can use the same notation here as in cross section, indexing each person, firm, city, etc. by $i$. Suppose we have 11 cross sectional datasets from 19 different years, pooling the data means to treat them as one larger sample and control for the fact that some observations are from a different year.

Panel Data

Panel data have a special structure, each row of the data corresponds to a specific individual and time period.

head(Grunfeld)

#>   invest  value capital           firm year
#> 1  317.6 3078.5     2.8 General Motors 1935
#> 2  391.8 4661.7    52.6 General Motors 1936
#> 3  410.6 5387.1   156.9 General Motors 1937
#> 4  257.7 2792.2   209.2 General Motors 1938
#> 5  330.8 4313.2   203.4 General Motors 1939
#> 6  461.2 4643.9   207.2 General Motors 1940

The data set has 11 firms with 220 observation. This particular panel data set is sometimes referenced as a balanced panel data set because we observe every single firm has 20 observations.

The equation for model becomes:

$y_{it} = \beta_0+\beta _1x_i+u_{it}$

where :

$y_{it}$ is the dependent variable
$i$ denoting households, individuals, firms, countries, etc.
$t$ denoting time.

The i subscript, therefore, denotes the cross-section dimension whereas t denotes the time series dimension. How do we account for the cross section and time heterogeneity in this model? this is done by using a two-way error component assumption for the disturbances, $u_{it}$ with:

$u_{it} = \mu _i+\lambda _t+\upsilon _{it}$

where:

$\mu _i$ represents the unobservable individual (cross section) heterogeneity
$\lambda _t$ denotes the unobservable time heterogeneity
$\upsilon_{it}$ is the remaining random error term

The first two components($\mu _i$ and $\lambda _t$) are also called within component and the last $\upsilon _{it}$, panel or between component

Heterogeneity

We need to make sure that our data is a balanced panel or an unbalanced panel. If we have an unbalanced panel, try to classify individuals or time periods and get some manageable.

plotmeans(formula = invest~year, main = "Heterogeineity Across Year", data = Grunfeld)

From the plot above, we have a balanced panel dataset with 11 observations every year.

Estimation model of panel regression

Pooled Regression

The (pooled) OLS is a pooled linear regression without fixed and random effects. It assumes a constant intercept and slopes regardless of group and time period. We will use the plm command with the option model = "pooling" to obtain the pooled estimates:

pooling <- plm(invest~value + capital, data = Grunfeld, model = "pooling", index = c("firm","year"))
summary(pooling)

#> Pooling Model
#> 
#> Call:
#> plm(formula = invest ~ value + capital, data = Grunfeld, model = "pooling", 
#>     index = c("firm", "year"))
#> 
#> Balanced Panel: n = 11, T = 20, N = 220
#> 
#> Residuals:
#>     Min.  1st Qu.   Median  3rd Qu.     Max. 
#> -290.331  -25.762   11.059   29.741  377.936 
#> 
#> Coefficients:
#>                Estimate  Std. Error t-value  Pr(>|t|)    
#> (Intercept) -38.4100540   8.4133709 -4.5654  8.35e-06 ***
#> value         0.1145344   0.0055188 20.7534 < 2.2e-16 ***
#> capital       0.2275141   0.0242283  9.3904 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Total Sum of Squares:    9712000
#> Residual Sum of Squares: 1768700
#> R-Squared:      0.81789
#> Adj. R-Squared: 0.81621
#> F-statistic: 487.284 on 2 and 217 DF, p-value: < 2.22e-16

Three additional arguments are common to these function:

index: this argument enables the estimation function to identify the structure of the data, the individual and the time period for each observation
effect: the kind of effects to include in the model, individual effects, time effects, and two-ways
model: the kind of model to be estimated, most of the time a model with fixed effects or a model with random effects

summary(pooling)

#> Pooling Model
#> 
#> Call:
#> plm(formula = invest ~ value + capital, data = Grunfeld, model = "pooling", 
#>     index = c("firm", "year"))
#> 
#> Balanced Panel: n = 11, T = 20, N = 220
#> 
#> Residuals:
#>     Min.  1st Qu.   Median  3rd Qu.     Max. 
#> -290.331  -25.762   11.059   29.741  377.936 
#> 
#> Coefficients:
#>                Estimate  Std. Error t-value  Pr(>|t|)    
#> (Intercept) -38.4100540   8.4133709 -4.5654  8.35e-06 ***
#> value         0.1145344   0.0055188 20.7534 < 2.2e-16 ***
#> capital       0.2275141   0.0242283  9.3904 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Total Sum of Squares:    9712000
#> Residual Sum of Squares: 1768700
#> R-Squared:      0.81789
#> Adj. R-Squared: 0.81621
#> F-statistic: 487.284 on 2 and 217 DF, p-value: < 2.22e-16

The Fixed Effects Model

Fixed effect model explore the relationship between predictor and outcome variables within and entity (country, person, company, .etc). Each entity has its own individual characteristics that may or may not influence the predictor variables. When using Fixed Effects Model we assume that something within the individual may impact or bias the predictor or outcome variables and we need to control for this. This is the rationale behind the assumption of the correlation between entity’s error term and predictor variables. Fixed Effect remove the effect of those time-invariant characteristics so we can assess the net effect of the predictors on the outcome variable⁴.

The equation for the fixed effects model becomes:

$ y_{it} = iX{it} + i + u{it} $

Where:

$y_{it}$ is the dependent variable, where i = entity and t = time
$\beta_i$ is the coefficient for independent variable
$X_{it}$ represents one independent variable
$\alpha _i$ that represents a unique value for each individual in the unit individual. it would represent the effect of all characteristics of an individual that do not change over time
$u_{it}$ is the error term

with $i = 1,...,n$ and $t=1,...,T$. The $\alpha _i$ are entity-specific intercepts that capture heterogeneities across entities. An equivalent representation of this model is given by

$ y_{it} = _0 + iX{it} + …+kX{it}+ _2D2_i+ _3D3_i +…+_nDn_i$

where the $\gamma _2D2_i+ \gamma _3D3_i + \gamma _nDn_i$ are dummy variables

There are several strategies for estimating a fixed effect model.

Least Square Dummy Variable

The least squares dummy variable model (LSDV) uses dummy variables. These strategies of course, produce the identical parameter estimates of regressor (non dummy independent variables). The between estimation fits a model using individual or time means of dependent and independent without dummies.

lsdv <- lm(invest~value + capital + factor(firm), data = Grunfeld)
summary(lsdv)

#> 
#> Call:
#> lm(formula = invest ~ value + capital + factor(firm), data = Grunfeld)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -184.008  -15.660    0.272   16.414  250.753 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                    -70.29907   47.37535  -1.484    0.139    
#> value                            0.11013    0.01130   9.746  < 2e-16 ***
#> capital                          0.31003    0.01654  18.744  < 2e-16 ***
#> factor(firm)US Steel           172.20381   29.70010   5.798 2.48e-08 ***
#> factor(firm)General Electric  -165.27033   30.28530  -5.457 1.38e-07 ***
#> factor(firm)Chrysler            42.48996   41.84991   1.015    0.311    
#> factor(firm)Atlantic Refining  -44.30345   48.12221  -0.921    0.358    
#> factor(firm)IBM                 47.13887   44.61445   1.057    0.292    
#> factor(firm)Union Oil            3.75484   48.19182   0.078    0.938    
#> factor(firm)Westinghouse        12.75258   41.98604   0.304    0.762    
#> factor(firm)Goodyear           -16.91548   46.17941  -0.366    0.715    
#> factor(firm)Diamond Match       63.73104   47.96886   1.329    0.185    
#> factor(firm)American Steel      49.72087   48.28006   1.030    0.304    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 50.3 on 207 degrees of freedom
#> Multiple R-squared:  0.9461, Adjusted R-squared:  0.9429 
#> F-statistic: 302.6 on 12 and 207 DF,  p-value: < 2.2e-16

Within Estimator

The within estimation doesn’t need dummy variables, but it uses deviations from group (or time period) means. That is within estimation uses variation within each individual or entity instead of a large number of dummies.

The within estimator substracts these unit-level means from the response, treatment, and all the controls:

$Y_{it}-\bar{Y_i} = (X_{it}-\bar{X_i})'\beta+ (u_{it}-\bar{u_i})$

Also note that since $\bar{Y_i}$ are unit averages, and the unobserved effect is constant over time, substracting off the mean also subtracts that unobserved effect.

The default behavior of plm is to introduce individual effects. Using the effect argument, one may also introduce:

individual effects( effect = “individual”)
time effects( effect = “time” )
individual and time effects ( effect = “twoways” ). The two-ways effect model is for the moment only available for balanced panels

Here’s the code using plm. Note that in this case, model = "within" means fixed effects for the entity variable.

within <- plm(invest~value + capital, data = Grunfeld, model = "within", index = c("firm","year"), effect = "twoways")
summary(within)

#> Twoways effects Within Model
#> 
#> Call:
#> plm(formula = invest ~ value + capital, data = Grunfeld, effect = "twoways", 
#>     model = "within", index = c("firm", "year"))
#> 
#> Balanced Panel: n = 11, T = 20, N = 220
#> 
#> Residuals:
#>      Min.   1st Qu.    Median   3rd Qu.      Max. 
#> -163.4113  -17.6747   -1.8345   17.9490  217.1070 
#> 
#> Coefficients:
#>         Estimate Std. Error t-value  Pr(>|t|)    
#> value   0.116681   0.012933  9.0219 < 2.2e-16 ***
#> capital 0.351436   0.021049 16.6964 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Total Sum of Squares:    1672200
#> Residual Sum of Squares: 459400
#> R-Squared:      0.72527
#> Adj. R-Squared: 0.67997
#> F-statistic: 248.15 on 2 and 188 DF, p-value: < 2.22e-16

The Fixed Effect model assumes an individual-speciﬁc intercept that is non-random, while all other coefficients are fixed (homogeneous across units)⁵. We can get the fixed effects constant for each firm and year using fixef(). The fixef() function returns an object of class fixef.

fixef(within)

#>    General Motors          US Steel  General Electric          Chrysler 
#>         -125.5411           76.7777         -264.8565          -37.3708 
#> Atlantic Refining               IBM         Union Oil      Westinghouse 
#>         -136.2723          -30.2288          -80.5651          -65.4880 
#>          Goodyear     Diamond Match    American Steel 
#>         -101.7344           -7.2787          -23.7715

fixef(within, effect = "time")

#>     1935     1936     1937     1938     1939     1940     1941     1942 
#>  -30.534  -47.494  -66.910  -66.158  -93.634  -70.359  -47.022  -48.534 
#>     1943     1944     1945     1946     1947     1948     1949     1950 
#>  -68.307  -68.855  -80.074  -58.289  -65.412  -68.865  -95.735  -97.922 
#>     1951     1952     1953     1954 
#>  -85.369  -87.023  -89.047 -112.328

A summary method is provided, which prints the effects (in deviation from the overall intercept), their standard error and the test of equality to the overall intercepts.

summary(fixef(within))

#>                    Estimate Std. Error  t-value  Pr(>|t|)    
#> General Motors    -125.5411    54.6650  -2.2966  0.022747 *  
#> US Steel            76.7777    26.7488   2.8703  0.004571 ** 
#> General Electric  -264.8565    26.3789 -10.0405 < 2.2e-16 ***
#> Chrysler           -37.3708    13.9735  -2.6744  0.008146 ** 
#> Atlantic Refining -136.2723    14.7514  -9.2379 < 2.2e-16 ***
#> IBM                -30.2288    12.2184  -2.4740  0.014247 *  
#> Union Oil          -80.5651    12.7329  -6.3273 1.781e-09 ***
#> Westinghouse       -65.4880    13.8303  -4.7351 4.300e-06 ***
#> Goodyear          -101.7344    12.7979  -7.9493 1.690e-13 ***
#> Diamond Match       -7.2787    11.0891  -0.6564  0.512381    
#> American Steel     -23.7715    11.1419  -2.1335  0.034178 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Relationship of type dmean and level and overall intercept

overall <- within_intercept(within)
fx_dmean <- fixef(within, type = "dmean") 
fx_level <- fixef(within, type = "level")
  
all.equal(overall + fx_dmean, fx_level, check.attributes = FALSE)

#> [1] TRUE

Calculation fitted values of two ways within model

fixefs <- fixef(within)[index(within, which = "id")]
fitted_by_hand <- fixefs + (within$coefficients["value"] * within$model$value) + (within$coefficients["capital"] * within$model$capital)

The Random Effects Model

In the random effects model, the individual-specific effect is a random variable that is uncorrelated with the explanatory variables. The advantages of random effects specification are: ⁶

The number of paramaters stay constant when sample size increases
It allows the derivation of efficient estimators that make use of both within and between (group) variation.
It allows the estimation of the impact of time invariant variables

The equation for the random effects model becomes:

\[y_{it} = \beta _iX_{it} + \alpha +\varepsilon_{it} + u_{it}\]

Where:

$u_{it}$ is between entity error.
$\varepsilon_{it}$ is within entity error

random <- plm(invest~value + capital, data = Grunfeld, model = "random", index = c("firm","year"), effect = "twoways")
summary(random)

#> Twoways effects Random Effect Model 
#>    (Swamy-Arora's transformation)
#> 
#> Call:
#> plm(formula = invest ~ value + capital, data = Grunfeld, effect = "twoways", 
#>     model = "random", index = c("firm", "year"))
#> 
#> Balanced Panel: n = 11, T = 20, N = 220
#> 
#> Effects:
#>                   var std.dev share
#> idiosyncratic 2443.62   49.43 0.283
#> individual    6206.26   78.78 0.717
#> time             0.00    0.00 0.000
#> theta: 0.8611 (id) 0 (time) 2.776e-17 (total)
#> 
#> Residuals:
#>      Min.   1st Qu.    Median   3rd Qu.      Max. 
#> -177.6721  -18.9754    4.1721   14.9430  253.1971 
#> 
#> Coefficients:
#>                Estimate  Std. Error z-value Pr(>|z|)    
#> (Intercept) -53.9812884  26.0814102 -2.0697  0.03848 *  
#> value         0.1093251   0.0099449 10.9931  < 2e-16 ***
#> capital       0.3081066   0.0163788 18.8113  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Total Sum of Squares:    2388700
#> Residual Sum of Squares: 549690
#> R-Squared:      0.76988
#> Adj. R-Squared: 0.76776
#> Chisq: 725.987 on 2 DF, p-value: < 2.22e-16

For a random model, the summary output gives information about the variance of the components of the error.

ranef(random)

#>    General Motors          US Steel  General Electric          Chrysler 
#>        -11.360155        154.988363       -175.795335         26.442593 
#> Atlantic Refining               IBM         Union Oil      Westinghouse 
#>        -58.348487         30.754167        -11.607130         -2.805524 
#>          Goodyear     Diamond Match    American Steel 
#>        -31.765611         46.565016         32.932104

Fixed Effect vs Random Effect Models

Panel data models estimate fixed and/or random effects models using dummy variables. The core difference between fixed and random effect models lies in the role of dummies. If dummies are consideres as a part of the intercept, it is a fixed effect model. In a random effect model, the dummies act as an error term.

The fixed effect model examines group difference in intercepts, assuming the same slopes and constant variance across groups. Fixed effect models use Leas Square Dummy Variable (LSDV), within effect, and between effect estimation methods. Thus OLS regression with dummies, in fact, are fixes effect models.

Selection Method of panel regression

We can check the several tests are the Chow Test, Hausman Test, and Lagrange Test to decide on the most appropriate model:

1. Chow Test for Poolability

Chow test is a test to determine the model of whether Pooled Effect or Fixed Effect (FE) is most appropriately used in estimating panel data. What is poolability? Poolability asks if slopes are the same across group or overtime.

H0: pooled effect model

H1: fixed effect model

pFtest(within,pooling)

#> 
#>  F test for twoways effects
#> 
#> data:  invest ~ value + capital
#> F = 18.476, df1 = 29, df2 = 188, p-value < 2.2e-16
#> alternative hypothesis: significant effects

From the result above, we can see the p-value is below 0.05 so a fixed effect model is better for this data

2. Hausman Test

Hausman test is a statistical test to select whether the most appropriate Fixed Effect or Random Effect model is used. The additional arguments is model, the kind of model to be estimated, most of the time a model with fixed effects or a model with random effects.

H0: Random Effect Model

H1: Fixed Effect Model

phtest(within, random)

#> 
#>  Hausman Test
#> 
#> data:  invest ~ value + capital
#> chisq = 13.253, df = 2, p-value = 0.001325
#> alternative hypothesis: one model is inconsistent

From the result above, the p-value is below 0.05 so a fixed effect model is better for this data

3. Lagrange Multiplier

Lagrange Multiplier to determine whether Random Effect (RE) is better than Comman Effect method.

H0: Pooled Effect Model

H1: Random Effect Model

plmtest(within,effect = "twoways",type = "bp")

#> 
#>  Lagrange Multiplier Test - two-ways effects (Breusch-Pagan) for
#>  balanced panels
#> 
#> data:  invest ~ value + capital
#> chisq = 881.07, df = 2, p-value < 2.2e-16
#> alternative hypothesis: significant effects

From the result above, the p-value is below 0.05 so a random effect model is better for this data

The test will help you to decide, but your explanation/motivation according to the data structure and research question to use an estimator or another it’s a better choice.

Diagnostics Test

Test of Serial Correlation

Serial correlation is the relationship between a variable and a lagged version of itself over various time intervals.

Serial correlation hypothesis test:

H0: There is not serial correlation

H1: There is serial correlation

pbgtest(random)

#> 
#>  Breusch-Godfrey/Wooldridge test for serial correlation in panel models
#> 
#> data:  invest ~ value + capital
#> chisq = 77.049, df = 20, p-value = 1.237e-08
#> alternative hypothesis: serial correlation in idiosyncratic errors

From the result above, the p-value is below 0.05 so there is serial correlation.

Testing for Heteroskedasticity

Heteroscedasticity is a condition where the variability of a variable is unequal across its range of value.

Heteroscedasticity hypothesis test:

H0: Homoskedasticity

H1: Heteroskedasticity

bptest(random)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  random
#> BP = 65.228, df = 2, p-value = 6.854e-15

From the result above, the p-value is below 0.05 so there is heteroscedasticity

Use Case

In this section, we will use case of panel regression in public capital productivity.

productivity <- read.csv("data_input/productivity.csv")
str(productivity)

#> 'data.frame':    816 obs. of  10 variables:
#>  $ STATE: chr  "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" ...
#>  $ YR   : int  1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...
#>  $ P_CAP: num  15033 15502 15972 16406 16763 ...
#>  $ HWY  : num  7326 7526 7765 7908 8026 ...
#>  $ WATER: num  1656 1721 1765 1742 1735 ...
#>  $ UTIL : num  6051 6255 6442 6756 7002 ...
#>  $ PC   : num  35794 37300 38670 40084 42057 ...
#>  $ GSP  : int  28418 29375 31303 33430 33749 33604 35764 37463 39964 40979 ...
#>  $ EMP  : num  1010 1022 1072 1136 1170 ...
#>  $ UNEMP: num  4.7 5.2 4.7 3.9 5.5 7.7 6.8 7.4 6.3 7.1 ...

The data has 816 observations and 10 variables. The target variable is Gross State Product (GSP). Variable STATE represents the entities or panels and YR represent the time variable. Here some description of each feature:

STATE : state name
YR : year
P_CAP : public capital
HWY : highway capital
WATER : water utility capital
UTIL : utility capital
PC : private capital
GSP : gross state product
EMP : employment rate
UNEMP : unemployment rate

Cross Validation

We split the data into data_train and data_test. The first 13 years will be the training data and the last 4 years will be data test.

set.seed(100)
data_train <- productivity %>% 
              filter( !YR %in% c(1983,1984,1985,1986))
data_test <- productivity %>% 
              filter( YR %in% c(1983,1984,1985,1986))

Exploratory data

Before we go further, we need to make sure that our data is balanced panel from 1970 to 1982.

plotmeans(formula = GSP ~ YR, main = "Heterogeineity Across Year",data = data_train)

From the plot above, we have a balanced panel with 48 observations every year.

Modelling

In this case, we will try to build 3 model from pooling, fixed, and random model.

mod_pooling <- plm(GSP ~ P_CAP + HWY + WATER +
                     UTIL + PC + EMP + UNEMP,
                   data = data_train,
                   model = "pooling")
summary(mod_pooling)

#> Pooling Model
#> 
#> Call:
#> plm(formula = GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP, 
#>     data = data_train, model = "pooling")
#> 
#> Balanced Panel: n = 48, T = 13, N = 624
#> 
#> Residuals:
#>     Min.  1st Qu.   Median  3rd Qu.     Max. 
#> -23508.4  -3502.1   1006.8   3028.9  33069.1 
#> 
#> Coefficients:
#>                Estimate  Std. Error t-value  Pr(>|t|)    
#> (Intercept) -7.1687e+02  8.0383e+02 -0.8918    0.3728    
#> P_CAP       -7.3614e+04  3.6801e+04 -2.0003    0.0459 *  
#> HWY          7.3615e+04  3.6801e+04  2.0003    0.0459 *  
#> WATER        7.3616e+04  3.6801e+04  2.0004    0.0459 *  
#> UTIL         7.3615e+04  3.6801e+04  2.0004    0.0459 *  
#> PC           3.0618e-01  1.1688e-02 26.1954 < 2.2e-16 ***
#> EMP          1.8854e+01  7.0890e-01 26.5968 < 2.2e-16 ***
#> UNEMP       -6.1782e+02  1.1387e+02 -5.4255 8.314e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Total Sum of Squares:    2.7027e+12
#> Residual Sum of Squares: 1.9502e+10
#> R-Squared:      0.99278
#> Adj. R-Squared: 0.9927
#> F-statistic: 12121 on 7 and 616 DF, p-value: < 2.22e-16

From the pooling model, all of independent variable significants to dependent variable and generate a value of 0.99 for adjusted R squared. The result of pooling model has intercept and slopes regardless of group and time period. Next, we will build within model:

mod_within <- plm(GSP ~ P_CAP + HWY + WATER +
                     UTIL + PC + EMP + UNEMP,
                   data = data_train,
                   model = "within",
                   index = c("STATE","YR"),
                   effect = "twoways")
summary(mod_within)

#> Twoways effects Within Model
#> 
#> Call:
#> plm(formula = GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP, 
#>     data = data_train, effect = "twoways", model = "within", 
#>     index = c("STATE", "YR"))
#> 
#> Balanced Panel: n = 48, T = 13, N = 624
#> 
#> Residuals:
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -7842.0  -718.9   -54.4     0.0   648.7  6593.5 
#> 
#> Coefficients:
#>          Estimate  Std. Error t-value Pr(>|t|)    
#> P_CAP  9.4220e+03  1.1642e+04  0.8093   0.4187    
#> HWY   -9.4222e+03  1.1642e+04 -0.8093   0.4187    
#> WATER -9.4224e+03  1.1642e+04 -0.8093   0.4187    
#> UTIL  -9.4216e+03  1.1642e+04 -0.8093   0.4187    
#> PC    -2.7614e-02  2.0574e-02 -1.3422   0.1801    
#> EMP    3.7677e+01  7.6494e-01 49.2555   <2e-16 ***
#> UNEMP  8.6094e+01  7.1320e+01  1.2072   0.2279    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Total Sum of Squares:    3.512e+10
#> Residual Sum of Squares: 1563600000
#> R-Squared:      0.95548
#> Adj. R-Squared: 0.9502
#> F-statistic: 1707.7 on 7 and 557 DF, p-value: < 2.22e-16

To check the intercept for each cross-section and time period, we can used function fixef().

fixef(mod_within)

#>        ALABAMA        ARIZONA       ARKANSAS     CALIFORNIA       COLORADO 
#>      -10424.53       -3566.05       -5122.71       -2201.48       -3368.31 
#>    CONNECTICUT       DELAWARE        FLORIDA        GEORGIA          IDAHO 
#>       -6830.42       -1991.55      -17674.65      -16897.08       -1587.21 
#>       ILLINOIS        INDIANA           IOWA         KANSAS       KENTUCKY 
#>      -11828.54      -12051.68       -2441.26        -516.46       -2156.66 
#>      LOUISIANA          MAINE       MARYLAND  MASSACHUSETTS       MICHIGAN 
#>       20243.90       -3992.07       -9682.28      -23539.79       -5966.18 
#>      MINNESOTA    MISSISSIPPI       MISSOURI        MONTANA       NEBRASKA 
#>       -8757.58       -5991.93      -11232.59         425.73       -3624.40 
#>         NEVADA  NEW_HAMPSHIRE     NEW_JERSEY     NEW_MEXICO       NEW_YORK 
#>       -1231.83       -3607.94      -10573.66        1585.53      -32166.92 
#> NORTH_CAROLINA   NORTH_DAKOTA           OHIO       OKLAHOMA         OREGON 
#>      -19458.55         474.27      -20503.08        3315.88       -4532.44 
#>   PENNSYLVANIA   RHODE_ISLAND SOUTH_CAROLINA   SOUTH_DAKOTA       TENNESSE 
#>      -32702.66       -4520.25      -12793.72        -894.22      -15095.64 
#>          TEXAS           UTAH        VERMONT       VIRGINIA     WASHINGTON 
#>       30577.25       -3716.22       -1987.88       -9702.83       -3925.40 
#>  WEST_VIRGINIA      WISCONSIN        WYOMING 
#>       -1573.68      -12151.91        4245.09

fixef(mod_within,effect = "time")

#>    1970    1971    1972    1973    1974    1975    1976    1977    1978    1979 
#> -6561.2 -5764.7 -5247.7 -4967.8 -6592.4 -6677.1 -5971.0 -5654.3 -5848.6 -6611.8 
#>    1980    1981    1982 
#> -7594.0 -7030.4 -7193.8

Next, we will build the random model:

mod_random <- plm(GSP ~ P_CAP + HWY + WATER +
                     UTIL + PC + EMP + UNEMP,
                   data = data_train,
                   model = "random",
                   index = c("STATE","YR"),
                  effect = "twoways")
summary(mod_random)

#> Twoways effects Random Effect Model 
#>    (Swamy-Arora's transformation)
#> 
#> Call:
#> plm(formula = GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP, 
#>     data = data_train, effect = "twoways", model = "random", 
#>     index = c("STATE", "YR"))
#> 
#> Balanced Panel: n = 48, T = 13, N = 624
#> 
#> Effects:
#>                     var   std.dev share
#> idiosyncratic 2.807e+06 1.676e+03 0.098
#> individual    2.571e+07 5.070e+03 0.900
#> time          5.286e+04 2.299e+02 0.002
#> theta: 0.9087 (id) 0.2753 (time) 0.2749 (total)
#> 
#> Residuals:
#>       Min.    1st Qu.     Median    3rd Qu.       Max. 
#> -8850.7380  -588.8907     4.4832   549.8404  9901.7155 
#> 
#> Coefficients:
#>                Estimate  Std. Error z-value  Pr(>|z|)    
#> (Intercept) -3.2450e+03  1.1435e+03 -2.8377 0.0045441 ** 
#> P_CAP        1.2397e+04  1.2799e+04  0.9686 0.3327469    
#> HWY         -1.2397e+04  1.2799e+04 -0.9686 0.3327483    
#> WATER       -1.2398e+04  1.2799e+04 -0.9686 0.3327336    
#> UTIL        -1.2397e+04  1.2799e+04 -0.9686 0.3327561    
#> PC           6.0680e-02  1.7461e-02  3.4752 0.0005104 ***
#> EMP          3.3964e+01  6.8185e-01 49.8107 < 2.2e-16 ***
#> UNEMP       -1.5642e+02  5.9315e+01 -2.6372 0.0083600 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Total Sum of Squares:    6.6693e+10
#> Residual Sum of Squares: 2126900000
#> R-Squared:      0.96811
#> Adj. R-Squared: 0.96775
#> Chisq: 18701.2 on 7 DF, p-value: < 2.22e-16

Variable private capital and employee rate have positively related to gross state product and other independent variable have negatively related to our target variable.

ranef(mod_random)

#>        ALABAMA        ARIZONA       ARKANSAS     CALIFORNIA       COLORADO 
#>    -5230.10493     1819.95913      -90.08252    18339.30247     2153.96993 
#>    CONNECTICUT       DELAWARE        FLORIDA        GEORGIA          IDAHO 
#>      382.82692     3198.92806    -8683.31375    -9655.38005     3013.77662 
#>       ILLINOIS        INDIANA           IOWA         KANSAS       KENTUCKY 
#>    -2514.21002    -6107.49603     1115.73924     2934.53554     2356.77859 
#>      LOUISIANA          MAINE       MARYLAND  MASSACHUSETTS       MICHIGAN 
#>    20080.55495     1305.33545    -2291.20789   -12570.66767     2684.74534 
#>      MINNESOTA    MISSISSIPPI       MISSOURI        MONTANA       NEBRASKA 
#>    -3331.05067    -1078.35286    -4937.46721     4143.34031      795.33684 
#>         NEVADA  NEW_HAMPSHIRE     NEW_JERSEY     NEW_MEXICO       NEW_YORK 
#>     3482.59290     1225.91024    -1061.28694     5718.19874    -8522.63749 
#> NORTH_CAROLINA   NORTH_DAKOTA           OHIO       OKLAHOMA         OREGON 
#>   -11998.91470     3967.56349   -11794.13171     6770.49466     1127.12389 
#>   PENNSYLVANIA   RHODE_ISLAND SOUTH_CAROLINA   SOUTH_DAKOTA       TENNESSE 
#>   -21524.30136     1283.93650    -6993.38460     2757.16259    -8700.68861 
#>          TEXAS           UTAH        VERMONT       VIRGINIA     WASHINGTON 
#>    28325.56849     1269.45013     2843.35365    -3205.10527     3144.51776 
#>  WEST_VIRGINIA      WISCONSIN        WYOMING 
#>     2297.18196    -5229.31019     6980.91007

After we build 3 models above, we need to decide which model will to choose from the test:

Chow test for poolability

pFtest(mod_within, mod_pooling)

#> 
#>  F test for twoways effects
#> 
#> data:  GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP
#> F = 108.31, df1 = 59, df2 = 557, p-value < 2.2e-16
#> alternative hypothesis: significant effects

The test has a p-value less than the significance level of 0.05, therefore we reject H0, we can conclude to select fixed effect model.

Hausman Test

phtest(mod_within, mod_random)

#> 
#>  Hausman Test
#> 
#> data:  GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP
#> chisq = 7.9098, df = 7, p-value = 0.3406
#> alternative hypothesis: one model is inconsistent

The test has a p-value above the significance level of 0.05, therefore we fail to reject H0, we can conclude to select random effect model.

Lagrange Multiplier

plmtest(mod_within, effect = "twoways", type = "bp")

#> 
#>  Lagrange Multiplier Test - two-ways effects (Breusch-Pagan) for
#>  balanced panels
#> 
#> data:  GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP
#> chisq = 1948.8, df = 2, p-value < 2.2e-16
#> alternative hypothesis: significant effects

The test has a p-value less than the significance level of 0.05, therefore we reject H0, we can conclude to select random effect model.

From the three test above, we will used random effect model. Next, we only check the assumption of random effect model:

Test of serial correlation

pbgtest(mod_random)

#> 
#>  Breusch-Godfrey/Wooldridge test for serial correlation in panel models
#> 
#> data:  GSP ~ P_CAP + HWY + WATER + UTIL + PC + EMP + UNEMP
#> chisq = 291.28, df = 13, p-value < 2.2e-16
#> alternative hypothesis: serial correlation in idiosyncratic errors

The test has a p-value less than the significance level of 0.05, therefore we reject H0, we can conclude there is serial correlation.

Heteroscedasticity

We need to check heteroscedasticity using the Breusch Pagan test.

bptest(mod_random)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  mod_random
#> BP = 276.65, df = 7, p-value < 2.2e-16

The test has a p-value less than the significance level of 0.05, therefore we reject the null hypothesis.

We can conclude that the assumption is not already passed. This may be a restrictive assumption in many panel data applications.

Predict data test

pred <- predict(mod_random,data_test)

Let’s check the performance of our model

RMSE(pred, data_test$GSP)

#> [1] 9376.271

MAPE(pred, data_test$GSP)

#> [1] 0.1312989

Conlusion

Panel regression is a method that can build model both the common and individual behaviors of groups. The advantage of a panel data set over a cross-section is that it will allow the researcher great flexibility in modeling of differences in behavior across individuals. The challenge from implementation panel regression is the assumption of no unobserved heterogeneity.

Home

Machine Learning

Data Visualization

Article List