David R. Henderson  

Educational Testing

PRINT
New York Times Admits to Being... Libertarian Tribalism...

In a comment on my post, "Home Schooling and Socialization," Scott writes:

Your views are very similar to those of John Taylor Gatto. Are you familiar with his work? Most libertarians I know consider him a hero which is interesting given his vociferous criticism of IQ testing
.
I am somewhat familiar with Gatto and I quoted him in my chapter on schooling and education in The Joy of Freedom: An Economist's Odyssey. Parenthetically, I wish school reformers, who include some of my colleagues at Hoover, would go in his direction rather than try to turn the Socialist Ship of State Schooling a few degrees.

Now to the testing issue. I'm not familiar enough with IQ testing to answer, but I reprint here my retelling, in the aforementioned chapter, of a story about educational testing told in 1995 by Charles Johnsen. Here's what I wrote:

The next story comes from an educational tester who, for reasons you will quickly understand, is no longer an educational tester. His name was Charles Johnsen, and when he wrote this story in 1995, he was a Lutheran preacher and a computer chip designer in Aurora, Colorado. Twenty years earlier, Johnsen worked on a proposed test for a testing firm in Chicago that had a contract with the Chicago Board of Education. One of the questions on the test was about the "el," the word used locally for Chicago's system of elevated trains. Johnsen put a photocopy of an actual el schedule in the test and asked a question like the following: "CJ has a job interview at 9:00 a.m. The company is a block away from the State Street station. What is the last train CJ can take to get to the interview on time?" Then he listed five different train times from the schedule, only one of them right.

The results surprised him. Students in the suburban schools did well on the other items but, in his words, "blew the item about the el big time." But the kids in the inner schools generally got the right answer on the el question even when, as was obvious from the rest of the test, they could barely read.

That was the first surprise. Then came his bigger surprise. His employer's software generated statistics about each test question. A "good" question did not discriminate between ignorance and knowledge, but instead discriminated between "good" students and "poor" students. No matter how important the question, if the obviously good students⎯the ones in the suburbs⎯got the answer wrong and the obviously bad students⎯the ones in the inner cities⎯got the answer right, the question was thrown out. I think of this story whenever I hear people say that the SAT does not discriminate against black people.


Comments and Sharing





COMMENTS (18 to date)
OneEyedMan writes:

When studying for the SATs I remember having to memorize the names for groups of all sorts of animals because at the time the SAT writers loved to ask rural themed questions. Now I know why.

michael writes:

The SAT was rewritten about 6 years ago to adjust for the bias you write about above.

The new test reduced the score differences between racial groups, but lowered the predictive power of the test. I.e. it wasn't able to predict college GPA as well as the earlier version.

I spent a few years as an SAT tutor. It doesn't really test a student's knowledge of math or English, per se. It really captures how much time he spent studying for the test.

It isn't designed to be something fair the demonstrates how much knowledge is in a student's head, but predicts college performance. A question that increases the score of students who can barely read clearly reduces the predictive power of the test. I think the new version is worse.

Eric Falkenstein writes:

His anecdote is almost surely exaggerated. It's as common as the anecdote about the test question with the word 'regatta' that was scuttled in the 1970's. Many have argued knowledge is relevancy based, so poor or black kids will have different intellectual skills tuned to their parochial environments.

Tests like the SAT are constructed using questions that predict the success of other questions. Questions that a particular demographic does poorly on are removed. That is, if the average black student who gets a 600 on the SAT tends to miss a question the average white 600 student tends to guess correctly, it is removed. It's called Differential Item Functioning (DIF). It's applied to the major groups people focus on (male/female, black/white). If all the questions do not show DIF between various demographics, its implausible to say the test is biased against those groups.

So, a good SAT test question is one that tends to correlate well with how they do on other SAT test questions. Now, what's the point of that? Generally such test are g-loaded, and g is the psychometric term for IQ. Surely you look at grades and test score when evaluating students, they are essential complements. Standardized tests are not the only thing, but like height in a basketball player, more is better on average, and especially helpful when you have 1000 applicants and 100 openings.

If anyone could create a test that passed DIF testing, and it would eliminate if not reverse the omnipresent achievement gap, you could make literally hundreds of millions of dollars licensing it to school districts. Most school districts are primarily focused on this problem in the context of No Child Left Behind, and would love to solve their problem merely by using a different test. But the test has to be validated, which involves DIF criteria among others. So if the Preacher was right, mazel tov, he's found gold; dig him up and have him give you more information asap. But I'm skeptical.

Mike writes:

20 years ago I had a part time job teaching SAT test prep. At that time, and I bet it's true today still, you could eliminate any answer anything that would be considered disparaging to women and minorities.

Years ago a wise lady from the ETS said: "We can't measure what's important, so we measure what we can." This remark stuck with me.

Ever notice that whenever people venture to assess school performance, they use standardized tests of reading vocabulary and Math, two easily quantifiable subjects? Seldom do people defend the proposition that any conclusions from these measures generalize.

Radford Neal writes:

Tests like the SAT are constructed using questions that predict the success of other questions.

So the perfect test would be one in which half the students got every question right, and the other half got every question wrong - regardless of what the questions were about? I think something is missing here. Like usefulness at predicting what the test is supposed to assess (college performance, for the SAT?), which need have nothing to do with whether one question predicts another.

Given the difficulty of really assessing predictive usefulness, using some degree of common sense seems desireable. For the question on the train schedule, common sense says that the suburban kids who couldn't answer it really were lacking something that the kids who could answer it had.

Tracy W writes:

My mum was trained in writing test questions as part of her teacher training in NZ. According to her, it's very easy to write a test question with unexpected meanings that the smart kids in the class identify, and thus get the answer "wrong". Which is something the test setter needs to look out for.
Being a teacher, mum was working on the scale of an individual class.

I do remember at school sometimes getting a test paper back and discovering a question was marked wrong, and on querying it being told that while my question was right "Don't think too much about the question" (absolutely useless advice by the way).

William Barghest writes:

"common sense says that the suburban kids who couldn't answer it really were lacking something that the kids who could answer it had."

Familiarity with reading the "el" schedule.

As I understand it, the ability to answer test questions of any sort is correlated. If you are likely to answer one type of question correctly then you are likely to answer questions of other types correctly as well. How far you are from the mean of general question answering ability you are is all that IQ is. The idea of an IQ test is to find a subset of questions which predicts your general question answering ability, without having to ask you millions of questions. Some questions are more predictive than others, since there can be differences in what things you are exposed to. Since the inner city kids who were on average worse at answering test questions of all types did particularly well at reading "el" schedules relative to the suburban kids who on average answered more questions correctly but had little "el" reading experience, this question would "not" predict the results of broader set of test questions, and its exclusion doesn't mean that the test is biased or that general question answering ability doesn't exist.

Chris writes:

So the perfect test would be one in which half the students got every question right, and the other half got every question wrong - regardless of what the questions were about? I think something is missing here.

A test has internal validity if all of it's questions correlate with each other. Internal validity suggests that all the questions are noisy measurements of a single quantity, and thus the law of large numbers can be used to combine them into a single score.

A test has external validity if it's score is a useful predictor of something else.

Falkenstein was explaining one test of internal validity, and why the question being discussed failed that test. That doesn't mean tests of external validity are ignored.

Dan Weber writes:
I do remember at school sometimes getting a test paper back and discovering a question was marked wrong, and on querying it being told that while my question was right "Don't think too much about the question" (absolutely useless advice by the way).

Modern exams for adults work like this, too. The CISSP exam (for computer security) doesn't have question that are written and the answers carefully considered -- any shmoe can write questions, and the answers are considered "good" if enough testers give the same answer. Which leads to real problems for people that know the topic subject well, but are trying to predict how the "average" test-taker would answer it.

Paavo Ojala writes:

So do you think that the testing firm didn't have other ways to measure the validity than the fact that suburban kids should obviously score higher?

Forget these silly anecdotes. If If there was a $20 bill on the ground, somebody would have already picked it up. And in this case a whole bunch of very observant and greedy people already have checked many many times. They have written books about the lack of the twensky bill. Or is it that psychologists and the whole educational establishment is so unbelievably racist or so absolutely stupid, that they've never thought of this kind of bias in mental testing.

Radford Neal writes:

A test has internal validity if all of it's questions correlate with each other. Internal validity suggests that all the questions are noisy measurements of a single quantity, and thus the law of large numbers can be used to combine them into a single score.

Why would one think that college performance depends on only a single quantity? At a minimum, I'd think it depends on IQ, knowledge, and motivation.

For a given number of questions, one clearly gets the most information from a test if questions are NOT predictive of other questions. So that would be the ideal, though it's perhaps not attainable while keeping the questions relevant to the objective of predicting college performance, and allowing for randomly wrong answers.

For the el schedule question, the real issue isn't to do with the inner city students who performed poorly on the other questions - they're not getting into college anyway. The real issue is whether the question is predictive of college performance among students whose performance on the rest of the test is such that answering the el question correctly may make the difference in whether they get in or not. It seems quite likely to me that inability to read a train schedule is not a good sign of ability to read lots of material encountered in an academic context.

JTBS writes:

Anyone who has spent any time riding the el would serously doubt the veracity of this tale. CTA publishes a timetable, but no one I have ever met relies upon it. It exists in an alternate realm where there is a 9:02 train from the Belmont red line station and trains run in evenly spaced increments throughout the day.

If an experienced el rider really wanted to get to a job appointment on time by taking the el, he would first ride the el the day before at approximately the same time of day to see how long it might take. Once he had figured this out, he would then add 45 minutes on the time to make sure that he would not be late for his appointment. Even then, our hapless job seeker would prepare for disappointment.

This is not a joke. I lived in Chicago for three years during which I rode the el almost every day. Never once did I refer to the timetable. No one else I knew did either. And these were all professional people with tight schedules. I seriously doubt that a group of CPS students would have any experience reading timetables either.

Dale writes:

Radford Neal wrote:

For a given number of questions, one clearly gets the most information from a test if questions are NOT predictive of other questions.

Only if you believe that there should be no correlation between how a person that is good at math answers two similar math questions.

Such correlation is a necessity of any reasonable model of the test.

Consider that the test questions do not explain someones intelligence, rather someones intelligence explains the test questions.

Its like doing a multiple regression with one independent variable and many dependent variables. Dependent variables without co-variance between them cannot be explained by the independent variable. I.E. if we have a dependent variable of "how good you are at math" and we expect this to increase both the answer to question 1 and question 2, then question 1 and question 2 must have co-variance.

This is compounded in that we can't actually measure intelligence and so have nothing hard to compare the values too.

Such the test of correlation is the only test we can do to ensure that the questions actually test something.

Intelligence is our independent variable and the test questions the dependent, measuring the co-variance between the dependent variables should let us back out(to some degree) the independent.

its not precise, but if the test questions are uncorrelated then we can make no predictions on the independent variable.

You are right in that if we expected the answers to the questions to explain how smart someone was and not the other way around, and we didn't believe that there was co-variance between the questions and we only had a limited number of questions to ask then uncorrelated questions would be the best test. But that model is wrong and generalities involving the size of standard errors hold no bearing if the model is incorrectly specified.

Dale writes:

Sorry for the double post; Forgot to add to this

"I.E. if we have a dependent variable of "how good you are at math" and we expect this to increase both the answer to question 1 and question 2, then question 1 and question 2 must have co-variance."

If question 1 and question 2 do not have co-variance then either question 1 or question 2 is not explained by the variable we want to measure and should be dropped from the test.

Radford Neal writes:

Dale: Yes, as I said, in practice, the best test isn't going to have independent responses to questions.

That's because we can't come up with a single perfect question to assess, say, IQ. If nothing else, it's possible that someone with low IQ will correctly guess the answer, or someone with high IQ will misread the question. So we need multiple questions that are trying to measure the same thing.

That doesn't mean that they should ALL be trying to measure the same thing, however, if the objective is to predict college performance, and that depends on more than one mental attribute. And if one finds that a new question that seems intuitively to be getting as something relevant is uncorrelated (or negatively correlated) with the others, one shouldn't just throw it out. It may be it's getting at something relevant that is missed by all the other questions!

Dale writes:

They don't all measure the same thing. Its why we have tests for different aptitude. The idea that you want uncorrelated questions on a test is just foolish. Ideally you want to test for one single thing per test and in order to do that you need to have correlation between the questions. If you don't then you've done something wrong.

Saying "we should do more tests than just the SAT" is not nearly the same thing as what you claimed, which was that uncorrelated questions would provide a better assessment. This is unequivically untrue, the more uncorrelated the answers(assuming that we don't have perfect correlation) then the less information we have about any particular measure.

Brittany Catamount writes:

Seeing as I am had to deal with state tests going up, I think they were completely unfair. But not because of the ridiculous questions that some of them asked. I will admit at times it did seem like these tests were weeding out the less knowledgeable students. This article also helps me realize how the tests are graded. I'm still not too sure about how I feel about them though. I feel like tests based on general information are not fair to anyone in general. Students should have time to prepare and not have to go over everything. Luckily, there are classes that help prepare for these tests now and many students have found them helpful.

Comments for this entry have been closed
Return to top