Bryan Caplan  

Embarrass Me Now, Please

PRINT
Some Questions About Partisan ... The Fed at 100: A century of ...
I'm now writing the most number-crunching parts of The Case Against Education: one chapter on education's selfish return, another on its social return.  The work is grueling, and haunted by the fear that I've made an early mistake that invalidates all my efforts.  Several true friends have spent hours in my office watching me walk through my work.  But I'm still nervous.  I don't want to toil for years only to be severely embarrassed when an eagle eye like Justin Wolfers exposes a critical error.

So I'm turning to you, EconLog readers, for help.  I'd rather release a flawed draft and endure mild embarrassment now than released a flawed book and endure deep embarrassment in 2017.   So embarrass me now, please.  Here's an Excel spreadsheet on which all my calculations depend. 

What am I doing?  Snapping together the following pieces:

1. 2011 Census numbers on average income for full-time, full-year workers, broken down by high school dropouts, high school only, B.A.s, and advanced degree holders.  I assume that the average dropout finished 10th grade, then quit school.

2. 2000-2013 St. Louis Fed numbers on unemployment rates, broken down by high school dropouts, high school only, B.A.s, and M.As.  I assume that the average dropout finished 10th grade, then quit school.

3. The punchline from my ability bias literature review, which concludes that 45% ability bias is reasonable.

4. The punchline from my sheepskin effect literature review, which concludes that, in percentage terms, the last year of high school has 3.4 times the effect of one regular high school year, the last year of college has 6.7 times the effect of one regular college year, and the last year of an advanced degree has 6.2 times the effect of a regular advanced degree year.

5. The CBO's numbers for private sector benefits as a fraction of pre-tax income, by education class.

6. Four student archetypes:

a. The Excellent Student, who has the raw ability of the typical advanced degree holder. 

b. The Good Student, who has the raw ability of the typical B.A. without an advanced degree.

c. The Fair Student, who has the raw ability of the typical high school grad who never went to college.

d. The Poor Student, who has the raw ability of the typical high school dropout.

By construction, a person whose educational attainment and raw ability "match" gets the observed income and unemployment rate for people with his educational attainment.  If you're a Poor Student who finished 10th grade, then stopped, you get the observed averages for drop-outs: $21,107 income and 9.8% unemployment rate.  If you're a Good Student who finished a B.A. but not an advanced degree, you get the observed averages for B.A.s: $59,415 income and 3.4% unemployment. 

However, if your educational attainment and raw ability don't match, I use (1) through (5) to compute your outcomes.  While an Excellent Student who drops out of high school earns much less than the typical advanced degree holder, he earns much more than the typical high school dropout thanks to his high raw ability.

7. Last, the spreadsheet calculates alternate "naive" estimates that set ability bias to 0%.

I'm happy to clarify my work in the comments or via email.  All feedback is welcome, but I deeply appreciate the revelation of demonstrable errors.  If you find any, bless you for embarrassing me now when the cost is mild.

P.S. Using Excel does not count as "demonstrable error."  It's my comparative advantage.


Comments and Sharing





COMMENTS (27 to date)
DanR writes:

Named ranges? Ewww...

John Hall writes:

@DanR I use named ranges in excel all the time. For long formulas, I'm thinking like conditional vlookups or something, it can definitely make the document less ugly and easier to update if changes need to get made.

That being said, the lesson of Rogoff/Reinhart and Picketty is to not use excel in the first place for serious work if you can avoid it. Get/Hire a grad student to convert whatever you've done into an R file. The piece of mind will save you in the long run.

Massimo writes:

Excel?!? For any semi serious analysis you should use R or python/pandas or something similar. Excel is for business types that can't use a shell.

Arthur B. writes:

Background: I use excel every day and have done so for a decade. I once built a small ray tracer* with it (pure Excel, no VBA) just because I could. So I'm making the following judgment based on an intimate understanding of it.

1/ do NOT use excel for this. It might be your comparative advantage for producing *an* output, but it isn't for producing *a correct* output.

2/ use R or Python.

3/ consider that the more likely reason everyone is telling you the same thing is that they are correct, not that they are purist who overestimate your ability.

4/ the spreadsheet is unreadable. What should be comments (right click, insert comment) is put in the middle of the sheet, preventing column auto resizing

5/ missing values should be NA() not blank cells

6/ use a column computing the last valid value to compute your ratios rather than hardcode the position of the cells where you have data

7/ the thing is unreadable

8/ 6.2, 3.4 and 6.7 are hardcoded in some formulas. Do not hard code *any* constant in a formula

9/ that was just the first page, I have no idea what you're doing. See, the problem with spreadsheets is their opacity. It makes it really hard to catch mistakes. Your bad practices as an Excel user make it worse.

10/ this looks like a very simple model, do use R or Python.

DanR writes:

@John Agree to disagree. What they add in readability, they subtract in easy auditability. Bryan's use of them (names shared across workbook but values change between sheets, unintuitive names {eg "ohsu"}, etc.) makes this a harder document to understand. But this isn't strictly a flaw, so whatever.

Looking at the file, I haven't found any errors, but I do have a few things that would help going forward.

1) I think the NaiveMeta tab has a mislabed title. Column D "% Income Gain Adjusted for 45% Ability Bias" isn't actually adjusted by that amount, which is confusing.

2) The adjustment percentage, as well as the documented sheepskin effects noted below the Meta tables are hard coded into the cell calculations themselves. (cell E6, for example) It is generally poor practice to use constants in formulas. It would be better to define these in a table and reference them that way.

3) It would be more helpful to use a coloring scheme to denote which cells were calculations, constants, and references to other sheets. In financial services, this is typically done with black, blue, and green fonts, respectively.

Nick writes:

I hate to repeat what everyone else is saying but while excel is great for prototyping it is pretty tough to check/audit a production model. A proper coding language is much better.

Fortunately, your comparative advantage is not in Excel, but rather, teaching economics. Find a couple of grad students (preferably comp sci or physics to make sure they can get the coding right) and pay them $15-20/hr to code this in. Have them work independently and make sure that they get the same results (and the same as the spreadsheet). For about $200 your problem will be solved.

Ross Levatter writes:

2017???!!!

Bryan Caplan, the George R. R. Martin of economics...

:-)

A.B. writes:

I should add that the time it would take any decent grad student to program this would be well under one hour. Have him sit next do you and do it as you explain the spreadsheet to him.

Steve J writes:

The "R" crowd needs to do a better job of making their case. Saying Excel is bad does not prove that R is good. By switching to R the number of people who can use the data is greatly reduced. What makes R so much better than Excel?

Cameron writes:

Dude, I'll buy you a SAS or similar software if you promise not to use Excel. Even the current version is 1990s technology.

Pat writes:

Please stop hectoring him about Excel. If you don't want to help, don't. Yeesh.

Massimo writes:

@Steve, with R, you keep your data in a .csv file, which I'd argue is much more widely accessible than a proprietary Excel file, and your transformations would be in an .r file. Why is it better? The quick version is because transformations and calculations are expressed in succinct/concise code which is easy to read and reproduce, rather than buried in a GUI, which is dumbed down for point-and-click philistines.

Finch writes:

FWIW, Excel itself is audited to a much higher standard than R. But code is easier to verify than Excel. It depends on where you think the errors are.

And +1 for Arthur B.'s comment that the spreadsheet is unreadable. At least give it to someone who has experience working in Excel to clean it up first.

Finch writes:

If you want a quick check that will improve your confidence in correctness, explain what you are doing to a student and have them implement it in SAS or Stata or R. Don't let them look at your Excel. Then hash out whatever differences arise. This shouldn't take more than a couple of hours.

Arthur B. writes:

Steve,

There are very few people who can do a good job at auditing the spreadsheet, because it is difficult to audit. You do not immediately see the formulas, it's hard to stop if the inputs in a formula are correct,
you have to go back and forth, etc.

There might be more people who have a basic command of Excel than R or Python, but there are far more people who can do a proper audit of the formulas in Python and R than people who can do a proper audit of the spreadsheet.

The same spreadsheet written as a python script would probably represent less than two dozen line of codes. The variables would have expressive names. The data would be held in separate files, thus making it easy to separately check the correctness of the raw data and the correctness of the code. You could also generate and try out new models much faster.

Robert S. writes:

I don't see any data or formulaic errors at this stage nor is the Excel argument important. However, I'd have conceptual concerns.

The current inputs you have now, as in basically "market income" are by no means ideal. Unfortunately, it is not clear you will ever be able to obtain the raw data necessary to run a lot of important correlations. Sources like the US Census or whatever you have from the government, or from private higher education studies, from experience do not have raw data broken down far enough, at best you will only find studies with samples from the population.

In addition, there are very strong a priori reasons and justifications from other research to think on this particular topic the data you are working with does not demonstrate heteroskedasticity and everything you propose so far is nowhere close to accounting for that.

For instance, the ability to marry higher earners might result in different returns to different educational choices for a subset of females compared to other females and males.

A given critic will likely be able to find several such arguments which at the current stage, with the entirety of the models you are considering, you won't be able to respond to. Regardless of flawlessness of data inputs, software code, whatever, critics will readily dismiss overall conclusions if your argument does not attempt to control for clear demographic factors like sex.

That doesn't make the effort not worthwhile, but it might be helpful to have extensive brainstorming on many possible unaccounted for factors and lines of criticism. Consider putting aside various parts of a schedule up through publication for this. Then concerns could be mentioned honestly and proposed as avenues for further explanation in advance, rather than being blindsided by critics at some point.

Linda Seebach writes:

I'd look carefully at the assumption that the average dropout completes 10th grade. The paper I used to work for, the Rocky Mountain News in Denver, did a big project following every student who entered ninth grade through to their theoretical graduation year, and found that most of the eventual dropouts never passed enough ninth-grade courses to enter 10th grade. Unfortunately the paper closed, so its archives have disappeared, but this must have been studied elsewhere.

Tom writes:

I'm just interested - why the general preference for R/Python over Stata? At least in my masters program, it's all about Stata (or Eviews for time series)

As I understand it, R is much trickier to use than Stata.

Is it just because R is free? (Don't know if Python is)

Finch writes:

> As I understand it, R is much trickier to use than Stata.

As I understand it, the budget allocated to proving Stata (or SAS) does what it says it does is much larger than the budget allocated to doing the same for R. So important projects often ground rule out R.

But R is a more modern programming language than Stata or SAS. That can mean "trickier," but it can also mean that it's more expressive and allows "better" programs.

Mark V Anderson writes:

I've been an accountant for the last 34 years, and have used Excel for the last 20 years or so. I guess I'm one of those business types who can't use a shell (whatever that is).

I have no idea what R or Python is. Can someone explain how these other programs have some kind of self-correcting feature so that the errors of Rogoff/Reinhart can be avoided? It seems to me in any program one needs to carefully check the formulas at every step to make sure they are right.

As had been said, the worksheet is hard for outsiders like us to understand. Prof Caplan would get much better commentary if he made the worksheet simpler. He should at least have a table somewhere that lists all the named ranges, with a definition and derivation. Also I agree that hard coding formulas is a bad idea, since that is asking for mistakes when changes are made.

Luca Ravioli writes:

[Comment removed for supplying false email address. Email the webmaster at econlib.org to request restoring your comment privileges. A valid email address is required to post comments on EconLog and EconTalk.]

powerR useR writes:

Proper R code with vectorization and *apply* functions is not definitively easier than Excel for auditing.

Matt writes:

Why do you care? Just conclude that inequality is bad and you'll win a nobel prize.

Nick writes:

@poweR useR:
As I tell the developers who work for me, premature optimization is the root of all evil. There is nothing "improper" about using simple loops in R. The right process would be to get the code working first. Then if performance is an issue, you benchmark. If benchmarking indicates a loop that is slowing you down, then you vectorize. But at that point you have a initial results that you can regression test against to ensure continuity and correctness.

In a broader sense, that is the heart of the answer to everyone who is wondering why a coding language is better than excel for this sort of thing. Software development has learned over many years how to test and produce quality code (even if those lessons aren't always followed). Within that domain, Bryan's issue is a solved problem. Using a spreadsheet tool like excel it is not. And in general, when there exists a standard way to do something, one should not be so quick to home roll your own solution.

David writes:

I don't see any obvious mistakes, but I don't understand the relationship between Meta column D (income gain adjusted for ability bias) and E (income gain adjusted for ability bias and sheepskin).

Is column D used anywhere? Did it go through some off-spreadsheet calculation to produce column E?

Eric Hammer writes:

As others have said, a grad student using Stata would be a good idea. At least then you would have blocks of code to demonstrate what you did with the data in case people ask, which Excel does not offer. No doubt people will demand replicability that is much easier to provide with Stata and pals.

In fact I believe you happen to know a local student who is doing work on just such a topic, and is also good at Stata. You might even remember his name when you see him this weekend :)

Michael Crone writes:

1. On the 'Meta' sheet, the entry in G4 is 9.8, while the note below (in cell A17) indicates it should be 9.9. A similar inconsistency exists on the 'NaiveMeta' page.

2. I'm not sure what you're doing with the student archetypes, but there may be a logical problem (not a programming one). If the advanced degree students earn an average of $87,981, then some earn more and some earn less. Are you modeling this group as having some excellent students (who earn on average $87,981 - cell b15 on that page) and some in each of the other categories (who all earn less). This is not mathematically possible. If any group earns less than average and has any advanced degrees at all, then some group has to earn more than average. (A similar argument could be made about the high school dropouts. Some student archetype should earn less than average.) Or do I understand the logic wrong?

Comments for this entry have been closed
Return to top