If you take

a look at the Census, education appears to be extremely lucrative. Back in 1975, drop-outs earned about 20% less than high school grads, college grads earned over 50% more than high school grads, and holders of advanced degrees earned over 100% more than high school grads. Nowadays, the differences are stronger still: drops-outs earn over one-third less than high school grads, college grads earn 83% more than high school grads, and holders of advanced degrees earn almost three times as much as high school grads.

Unlike many labor economists, I

freely admit that an appreciable fraction of these wage gaps are actually caused by pre-existing ability. Education hardly deserves full credit for the observed education premia. Lately, though, I've been thinking about a data problem that potentially leads us to

*under*state the full effect of education: measurement error.

The trouble is that the Census is based on self-reports. While self-reports are far from worthless, they're also far from perfect. Imagine, then, that 10% of high school graduates check the "college grad" box, and 10% of college grads incorrectly check the "high school" box. What happens?

Suppose the average high school grad earns $25,000/year, and the average college grad earns $75,000 per year. That's a 200% college premium. The Census, however, will tabulate their average earnings to be .9*$25,000+.1*$75,000=$30,000 for high school grads, and .9*$75,000+.1*$25,000=$70,000 for college grads. That's a mere 133% college premium.

Statisticians have a standard way to correct for problems like these. Just measure the

reliability of your suspect variable, then apply the appropriate correction. For example, education has a reliability of about .9. If you ignore this measurement error when you estimate the education premium in the General Social Survey, controlling for a short IQ test (WORDSUM) and age, you get the following results. (Logrealinc is the log of family income).

------------------------------------------------------------------------------

logrealinc | Coef. Std. Err. t
P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

educ | .0970101 .0023115 41.97
0.000 .0924794 .1015407

wordsum | .0715542 .0032552 21.98
0.000 .0651739 .0779345

age | -.0002933 .0003638
-0.81 0.420 -.0010063 .0004198

_cons | 8.280534 .0335161 247.06
0.000 8.21484 8.346227

------------------------------------------------------------------------------

Long-story short: Ignoring measurement error, the education premium is 9.7% per year of education.

If you correct for measurement error in education, however, the education coefficient goes up to 11.3%:

------------------------------------------------------------------------------

logrealinc | Coef. Std. Err. t
P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

educ | .1125098 .0026644 42.23
0.000 .1072873 .1177322

wordsum | .0606613 .0033714 17.99
0.000 .0540531 .0672695

age | .0002703 .0003649
0.74 0.459 -.0004449 .0009855

_cons | 8.12053 .0361073 224.90
0.000 8.049757 8.191303

------------------------------------------------------------------------------

If you're a cheerleader for education, you'll seize on these results to argue that standard estimates consistently understate the education premium. There's just one problem with this reaction:

*ALL variables have measurement error!* When you correct for only one form of measurement error, ignoring all the others, you stack the deck in favor of the variable you fix. Look at what would have happened if we corrected the original results for IQ's measurement error (reliability=.74), ignoring all other data problems:

------------------------------------------------------------------------------

logrealinc | Coef. Std. Err. t
P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

educ | .0835483 .0026555 31.46
0.000 .0783434 .0887533

wordsum | .109542 .0049556 22.10
0.000 .0998287 .1192552

age | -.0009268 .0003671
-2.52 0.012 -.0016464 -.0002072

_cons | 8.253405 .0334372 246.83
0.000 8.187866 8.318944

------------------------------------------------------------------------------

If we correct for mismeasurement of IQ, and ignore mismeasurement of education, the estimated education premium actually falls to 8.4%. At the same time, the measured effect of IQ jumps from .07 (one more question right on the ten-question test boosts income by 7%) to .11 (one more question right on the ten-question test boosts income by 11%).

If you really take measurement error seriously, you have to correct for

*all* forms of measurement error. When you do, it's entirely possible for the measured effect of education not to rise. Look at what happens if we simultaneously correct for measurement error of education and IQ:

------------------------------------------------------------------------------

logrealinc | Coef. Std. Err. t
P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

educ | .0989883 .0031315 31.61
0.000 .0928503 .1051263

wordsum | .0948702 .0052509 18.07
0.000 .0845781 .1051623

age | -.0003461 .000371
-0.93 0.351 -.0010733 .0003812

_cons | 8.116265 .0359477 225.78
0.000 8.045805 8.186724

------------------------------------------------------------------------------

Correcting for both forms of measurement error sharply inflates the estimated effect of IQ, but the education premium is virtually identical to the naive estimate that ignores measurement error entirely.

What would happen if we had a typical long list of control variables, each and every one corrected for measurement error? It's entirely reasonable to expect the education premium to

*fall. *After all, measurement error corrections have larger effects for variables measured with

*low *reliability, and most variables are less reliably measured than education.

HT: Steve Miller, who ran the STATA errors-in-variables regressions for me just days before the birth of his fifth child.

It is also true that measurement error in earnings is non-classical. High earners tend to under-report, and low earners tend to over-report. This suggests the actual college-high school earnings gap is larger than typically measured.

I doubt that:

"Long-story short: Ignoring measurement error, the education premium is 9.7% per year of education."

Even if it did, this does not tell the story of those without jobs, those stuck in less demanding jobs, those not working with the skills and knowledge that they acquired.

Besides, what I want to see is the actual distribution curve -- across age, SES, education -- based not on some survey, but real data. Cohorts manage to earn more or less -- not based on education -- but other things, like recessions, and SES.

I have doubts that 10% of people who complete the census forms check the wrong education box. But, that's not very important. The biggest problem with census-derived data is selection bias. Many people (such as myself) refuse to complete census forms except for the minimum information required by law: number of persons in household. People who complete census forms have different personalities than those who do not. Those differences almost certainly do not have the same distributions in each educational category or income range. If one does not know the form completion rates for each category and income range, then one cannot generate valid conclusions.

I personally would like to see these sorts of analysis compared to "percentile standing of educational achievement". That is, how much of the gap is because people who didn't finish high school are earning less, and how much is because people who didn't finish high school used to be say 45% of the working population and they are now say 20% of the working population?

(I have no idea how to get at the relevent data - anybody know someone who could be comissioned to pursue this?)

A related question is - in past decades, many people didn't finish high school due to economic or political circumstance. My sense is that in the current era, people who don't finish high school are more likely obstructed by other sorts of issues. Put another way, not finishing high school in 1952 might well have very different implications about a person than not finishing in 2013.

Whoa, a libertarian running Stata regressions. #ThingsYouDontSeeEveryDay