Econometrics Beat: Dave Giles' Blog
Sun, 27 Oct 2019 18:00:34 GMT
*language*

*This post was prompted by an email query that I received some time ago from a reader of this blog. I thought that a more "expansive" response might be of interest to other readers............*

In spite of its many limitations, it's standard practice to include the value of the coefficient of determination (R^{2}) - or its "adjusted" counterpart - when reporting the results of a least squares regression. Personally, I think that R^{2} is one of the least important statistics to include in our results, but we all do it. (See this previous post.)

^{2}has a number of well-known properties. These include:

- 0 ≤ R
^{2}≤ 1. - The value of R
^{2}cannot decrease if we add regressors to the model. - The value of R
^{2}is the same, whether we define this measure as the ratio of the "explained sum of squares" to the "total sum of squares" (R_{E}^{2}); or as one minus the ratio of the "residual sum of squares" to the "total sum of squares" (R_{R}^{2}). - There is a correspondence between R
^{2}and a significance test on all slope parameters; and there is a correspondence between changes in (the adjusted) R^{2}as regressors are added, and significance tests on the added regressors' coefficients. (See here and here.) - R
^{2}has an interpretation in terms of information content of the data. - R
^{2}is the square of the (Pearson) correlation (R_{C}^{2}) between actual and "fitted" values of the model's dependent variable.

*.*

**none of the above properties are guaranteed***different*R

^{2}values depending on which of the two definitions noted in property 3 above is adopted. Similarly, when estimating Logit and Probit models (for instance), most econometrics packages report several "pseudo-R

^{2}" statistics, because there's no single measure that has

*all*of the desirable features that we're used to in the linear model/OLS case.

So-called "count" data arise frequently in empirical economics. These are data that take values that are only non-negative integers, namely 0, 1, 2, 3, 4, ........ Models for such data are often based on the Poisson or negative binomial distributions, although other distributions may also be used. Regressors enter the model by equating the mean of the chosen distribution to a positive function of these variables and their coefficients.

For instance, if the y_{i} data (i = 1, 2, ...., n) are being modelled using a Poisson distribution with a mean of μ, then we typically assign μ_{i} = exp[x_{i}'β], using familiar regression notation. The resulting non-linear model is then estimated by MLE (or quasi-MLE).

What's a sensible way of reporting an R^{2} measure for an estimated Poisson regression?

^{2}that really stands out as the obvious choice.

What is it?

Before answering this question, let's look at how R_{R}^{2}, R_{E}^{2}, and R_{C}^{2} behave when applied in the context of Poisson, or negative binomial, regression. Some key facts include:

- The three measures will generally differ in value from one another.
- We still have 0 ≤ R
_{C}^{2}≤ 1. However, although R_{R}^{2}≤ 1 it can be negative (even if an intercept is included in the model); and although R_{E}^{2}≥ 0 it can be greater than one (even with an intercept). - All three measures can
*decrease*as regressors are added to the model.

When we compare these results with the six properties noted above for the OLS case, they suggest that these R^{2} measures are probably best avoided with count data models. Interestingly, it's R_{R}^{2} that's reported as a matter of course by the EViews package. Stata, on the other hand, reports McFadden's "pseudo-R^{2}" for these models, but its properties are no better.

Cameron and Windmeijer (1996) effectively answer the question that I posed above.

They consider various R^{2}-type measures for count data models. These measures differ primarily on the type of residuals (from the estimated model) that are used in their construction. As in the case of a linear regression, *the usual, or "raw", residuals* are the differences between the actual y_{i} values and their "predicted" mean values. That is, they're of the of the form (y_{i} - μ_{i}*), where μ_{i}* = exp[x_{i}'β*], and β* is the MLE of the β vector. These residuals give us R_{R}^{2}, noted above.

In regression analysis in general, there are actually lots of different forms of residuals that can be constructed, and these can be useful in various situations - especially with generalized linear models (of which the Poisson count models is an example). Some examples include the Pearson (standardized) residuals and the so-called "deviance" residuals. (for more on the notion of "deviance" and goodness-of-fit, see this post.)

Cameron and Windmeijer (1996) consider the properties of R^{2} measures for Poisson and negative binomial models based on both of these other types of residuals, as well as on the "raw" residuals. (Cameron and Windmeijer (1997) extend these results to a variety of other non-linear models.)

They make a convincing case for constructing an R^{2} measure using the deviance residuals, when working with a Poisson regression model or the negative binomial (NegBin2) model.

(*As an aside, when the model is linear and we use OLS, the deviance residuals are just the usual residuals*.)

For the Poisson model, the i^{th.} deviance residual is defined as

d_{i} = sign(y_{i} - μ_{i}*)[2{y_{i}log(y_{i} / μ_{i}*) - (y_{i} - μ_{i}*)}]^{½ }; i = 1, 2, ...., n

and the deviance R^{2} for that model is defined as:

R_{D,P}^{2} = 1 - Σ{y_{i}log(y_{i} / μ_{i}*) - (y_{i} - μ_{i}*)} / Σ{y_{i}log(y_{i} / ybar)},

where here and below all summations are for i = 1, 2, ...., n.

If the model includes an intercept, then this formula simplifies to:

R_{D,P}^{2} = 1 - Σ{y_{i}log(y_{i} / μ_{i}*)} / Σ{y_{i}log(y_{i} / ybar)}.

(Note: if y_{i} = 0, then y_{i}log(y_{i}) = 0. In this case, d_{i} = - [2μ_{i}*]^{½}.)

**Importantly, R _{D,P}^{2} satisfies the properties 1 to 5 noted earlier**.

In the case of the NegBin2 model, the corresponding R^{2} takes the form:

R_{D,NB}^{2} = 1 - (A / B) ,

where

A = Σ{y_{i}log(y_{i} / μ_{i}*) - (y_{i} + α*^{-1})log[(y_{i} + α*^{-1}) / (μ_{i}* + α*^{-1})]}

and

B = Σ{y_{i}log(y_{i} / ybar) - (y_{i} + α*^{-1})log[(y_{i} + α*^{-1}) / (ybar + α*^{-1})]}.

("ybar" is the sample average of the y_{i} values; and α* is the MLE of the dispersion parameter for the NegBin2 distribution.)

**The R _{D,NB}^{2} goodness-of-fit measure satisfies properties 1, 3 and 4 noted earlier.**

**So, when it comes to reporting an R**

^{2}for count data models, the usual such measure - based on the "raw" residuals - is generally a very poor choice.

*Of the other options that are available, the R*

^{2}measures constructed using the so-called "deviance residuals" stand out as excellent contenders.**References**

*Journal of Business and Economic Statistics*, 14, 209-220. (Download working paper version.)

Cameron, A. C. & F. A. C. Windmeijer, 1997. An R-squared measure of goodness of fit for some common nonlinear regression models. *Journal of Economerics*, 77, 329-342.

*language*