Bryan Caplan  

Multicollinearity and Micronumerosity

PRINT
Small vs. Large Businesses... Planning vs. Improvisation...

R.J. Rummel's critique of a Cato study set off a big bloggers' debate about the value of think tanks. The following passage in Rummel's critique got my attention:

This correlation is meaningful for the kind of regression analysis Gartzke did, but he apparently doesn't know it. A problem in regression analysis is multicollinearity, which is to say moderate or high correlations among the independent variables. If two independent variables are highly correlated they are no longer statistically independent, and the first one entered into the regression, in this case economic freedom, steals that part of the correlation it has with democracy from the dependent variable. Thus, economic freedom is highly significant, while democracy is not. If Gartzke had done two bivariate regressions on his MID data, one with economic freedom and other with democracy as the independent variables, he surely would have found democracy highly significant.

This reminds me of one of the best few pages I've ever read in a textbook. The book: Arthur Goldberger's A Course in Econometrics. The subject: Multicollinearity and micronumerosity.

Goldberger's main point: People who use statistics often talk as if multicollinearity (high correlations between independent variables) biases results. But it doesn't. Multicollinearity leads to big standard errors, but if your independent variables are highly correlated, they SHOULD be big! Intuitively, big standard errors mean that the effects of different variables are highly uncertain, and if your independent variables are highly correlated, highly uncertain is what you should be.

Goldberger brilliantly drives his point home by introducing the concept of micronumerosity. What's that? A fancy name for "not having a lot of data." If you don't have a lot of data, then again your standard errors tend to be large. As well they should be! If you have three data points, you should be uncertain of your results.

Conversely, of course, if your independent variables are highly correlated, or your number of observations is small, and you still get strong statistical results, this shows that you have a good reason to believe your conclusion is true. Standard statistical methods have already adjusted for these problems; if you get meaningful answers anyway, you've got nothing to apologize for.


Comments and Sharing





TRACKBACKS (3 to date)
TrackBack URL: http://econlog.econlib.org/mt/mt-tb.cgi/363
The author at Mahalanobis in a related article titled Collinearity writes:
    some remarks [Tracked on September 23, 2005 8:43 AM]
COMMENTS (14 to date)
Radford Neal writes:

The critique by Rummel suffers from at least two statistical misconceptions. First, as quoted in the post above, he says

If two independent variables are highly correlated they are no longer statistically independent, and the first one entered into the regression, in this case economic freedom, steals that part of the correlation it has with democracy from the dependent variable. Thus, economic freedom is highly significant, while democracy is not.

But for a simple regression model, the order of the variables is irrelevant. The first doesn't "steal" the correlation from the second - the standard errors for both are simply increased.

He goes on to express a second misconception:

(An important statistical point about his use of significance tests -- he is not analyzing a sample, but the universe of cases -- thus standard significance tests are irrelevant).

This is totally wrong. We already know what wars were fought in the past. The whole purpose of the study is to predict what might happen in the future. The future cases are NOT in the available sample. Signficance tests are entirely appropriate.

I haven't read the original study, so I can't say whether it was competently done, but Rummel's critique certainly provides no grounds for doubting its validity.

InsertTextHere

Robert Book writes:
If you don't have a lot of data, then again your standard errors tend to be large. As well they should be! If you have three data points, you should be uncertain of your results.

Yes, but if you have only two data points, you should be uncertain of your results, but you'll appear to be very certain! (In a problem with one independent variable.)

Actually, now that I think about it, if you have three data points and a problem with two independent variables, you'll appear to be very certain also, though you should not be!

JAV writes:

I'm puzzled by this post. I'll take a simplified example to highlight my confusion. It's been a while since I did stats so pardon me if I make elementary errors.

Let aD + bE = P
Suppose E = 2*D (Perfect correlation)

A possible soln (when P=101, and D =1) would be (a =1, b=50), another soln would be (a=99, b=1) and so on.

So as you put it, the effects of different variables on the output are highly uncertain. Gartzke is claiming Economic freedom as more significant than Democracy. I feel one should really be hesitant to draw that conclusion. One could easily draw erroneous conclusions about the effects the different variables have.
How do standard stat methods already adjust for this?

PS: If E.F and Democracy are highly correlated then how could E.F be 50 times more effective than Democracy!!

jsmith99 writes:

Robert Book writes,

Yes, but if you have only two data points, you should be uncertain of your results, but you'll appear to be very certain! (In a problem with one independent variable.)

That's very true.

If there are only two data points for a model with a constant term and an independent variable, you have zero degree of freedom. Ordinary least square regression will therefore give you not only a perfect fit, but also infinitely large standard errors.

You'll appear to be very certain because you see a perfect fit, but you should be uncertain of you results becasue the standard errors are infinitely large.

dsquared writes:

Peter Kennedy's "Guide to Econometrics" is also very good on this subject, pointing out that with a small number of exceptions (he doesn't use the example, but John Lott's work on the 2000 Florida elections where he more or less intentionally constructed a model so as to be collinear would be one of the exceptions), multicollinearity is a feature of the data, not of the model, and to claim that "the data are biased" brings into sharp relief how silly most critiques of multicollinearity are.

Aaron Chalfin writes:

It all depends on the context. If you are using the regression model primarily to predict y, there will not be any bias (though due to high standard errors the prediction may not be very good.)

On the other hand, if you are using the model to determine the relative impacts of X1 vs. X2, you may have a problem.

Roger McKinney writes:

My experience with regression comes from the quality control field where you must have independent predictor variables in order to determine which one causes the defect. We achieved independence by designing experiments. But if we had to use historical data instead, we would sove the problem of multicollinearity by using factor analysis or partial least squares, both of which combine the data into fewer, but independent predictors. It works quite well. I've done something similar with the Heritage Foundation's Freedom Index and found that all of their variables fit nicely into just two orthogonal factors.

Roger McKinney writes:

One more thought: My experience has been that with correlated predictors, the coefficients change, sometimes even the sign, as you enter and remove different predictors. So if they're as highly correlated as are economic freedom and democracy, which coefficients are correct? In addition to Rummel's suggestion of using factor analysis, I would think that the Cato authors would have at least tried step-wise regression, which often retains the strongest predictor and drops the weaker ones.

Aaron Chalfin writes:

This is what I was getting at. Cato is trying to use regression to directly compare the strength of various predictors (economic freedom & democracy). So they do not want to drop either variable from the regression.

As much as I love to apply regression to almost any set of data I can find, I think this might be a time when a series of thoughtful case analyses would be more convincing than regression.

Jon writes:

Rummel seems to be stale on multilinear regression. He proposes orthogonalization as a "cure". This does nothing to address the question. If two variables a and b are highly correlated, orthogonalization would just replace them by something like a+b and a-b. This does nothing to resolve the cause and effect issue of whether a or b is more significant.

There are other problems with the CATO study though. First, does lack of economic freedom cause war or is it a result of the same factors causing war or does war cause a loss of economic freedom. Secondly, "economic freedom" is a myriad of items. Some may prevent war and some may make it more likely. For example leading up to war, a government may impose higher taxes and inact rationing. There may also be mandatory conscription and restrictions on imports or exports.

Roger McKinney writes:

Rummel isn't stale; he's stating a fact: When two predictors are highly correlated, you can't compare the strengths of their effects. My experience has been that regression will take the slightly stronger effect and give it a large coefficient, then, as if to compensate, give the slightly weaker effect a very low coefficient or even an opposite sign. If Cato would leave out economic freedom, they would find the coefficient of democracy increasing dramatically. The best one can say in this situation is that the two go together and it's impossible to separate the impact of one from the other.

Aaron Chalfin writes:

What Cato could have tried is a test to see if economic freedom is either a mediator or a moderator of democracy or vice versa.

My own guess is that democracy explains why there is a relationship between economic freedom and violence/warfare and is therefore a partial mediating variable in the analysis. In other words, the relationship between economic freedom and warfare is made stronger by the inclusion of democracy.

Bill Stepp writes:

Rummel also thinks that no democracies ever fought a war against each other, which is contradicted by both the Civil War (or War of Southern Independence/Northern Aggression) and WW One.
Democracy, as Mencken pointed out, is an advance auction sale of stolen property, so it's far removed from economic freedom, even if freedom is better secured under it than under totalitarianism.

Rudy Rummel writes:

RE: your comments on my use of the term multicollinearity, see my Monday blog (9/25/05) for a "Little Primer on Multicollinearity" at: http://freedomspeace.blogspot.com/ This is not to say you should read such a primer, but simply to bring it to your attention.

Comments for this entry have been closed
Return to top