February 13, 2009

Alfie Kohn and the Murray Gell-Mann Amnesia effect

(the introduction can be found here)

In the comments of the recent Willingham-Kohn dust-up, edu-blogger Stuart Buck brought up DI and Kohn responded by citing this article by him which immediately reminded me of the Murray Gell-Mann Amnesia effect.

The late Michael Crichton once gave a speech describing what he termed the Murray Gell-Mann Amnesia effect.

Briefly stated, the Gell-Mann Amnesia effect is as follows. You open the newspaper to an article on some subject you know well... You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect. I call these the "wet streets cause rain" stories. Paper's full of them.

In any case, you read with exasperation or amusement the multiple errors in a story, and then turn the page to national or international affairs, and read as if the rest of the newspaper was somehow more accurate about Palestine than the baloney you just read. You turn the page, and forget what you know.

I know enough about the research on DI to know that Kohn's description of the DI research qualifies as one of the worst hatchet jobs in education policy advocacy. As such, it should serve as evidence that Alfie Kohn might not be a trustworthy source on education policy and that his analysis of education research should be closely scrutinized in order to stave off the Murray Gell-Mann Amnesia effect.

But don't take my word for it, let's review Kohn's description of the DI research.

After we get past an initial paragraph of over-heated inflammatory language, Kohn's first argument is related to the results of Project Follow Through (PFT) is:

Of course, even if these results could be taken at face value, we don’t have any basis for assuming that the model would work for anyone other than disadvantaged children of primary school age.

But, PFT involved some schools in middle class neighborhoods and middle-class kids took part and were evaluated as part of the research. In fact, a very diverse set of students were evaluated.

The DI model was the best performing model for "disadvantaged children" as Kohn acknowledges, but it was also the best performing model for high-performing students, White children, Native Americans, African-American students, Hispanic students, English language learners, urban children, rural children, and the very lowest of the disadvantaged children. So there is a basis, a good basis, for assuming that the DI model works for most primary school aged children. And, in fact, these results have been replicated numerous times in subsequent studies which Kohn also fails to acknowledge.

Kohn's next argument is a repetition of the "variability" argument that others have levied against the PFT results:

To begin with, the primary research analysts wrote that the “clearest finding” of Follow Through was not the superiority of any one style of teaching but the fact that “each model’s performance varies widely from site to site.”[1] In fact, the variation in results from one location to the next of a given model of instruction was greater than the variation between one model and the next. That means the site that kids happened to attend was a better predictor of how well they learned than was the style of teaching (skills-based, child-centered, or whatever).

This is spurious conclusion because most of the variability is attributed to the inclusion of two cohorts from the Grand Rapids site in the analysis which had severed ties with the DI sponsor well before the end of the study. This is well documented. The data from the Grand rapids site was the only "DI site" with low performance (a half standard deviation below the mean of the other DI sites). The Grand Rapids site is the only "DI site" that consistently fell below national norms. In fact, most of the variability in the remaining DI sites is above National norms.

In addition, subsequent research, involving researchers with at least some professional/reputational ties to DI, has shown that the variability between sites is mostly attributable to demographic factors and experimental error, and not to the DI program.

We disagree with both Abt and House et al. in that we do not find variability among sites to be so great that it overshadows variability among models. It appears that a large part of the variability observed by Abt and House et al. was due to demographic factors and experimental error. Once this variability is brought under control, it becomes evident that differences between models are quite large in relation to the unexplained variability within models.

In any event, even if the variability finding was characterized as the main finding by the primary researchers, this in no way diminishes the finding that DI was the superior performing program across the board for all measures tested for all groups tests. It's still a valid finding and has been upheld by numerous researchers examining the findings since the initial evaluation.

So, best performing program and the only program whose variability was mostly above national norms does not a valid criticism make. Strike two for Kohn.

Next Kohn, attacks the testing instruments used in PFT:

Second, the primary measure of success used in the study was a standardized multiple-choice test of basic skills called the Metropolitan Achievement Test. [(MAT)]

The MAT is not just a test of basic skills (such as Listening for Sound (sound-symbol relationships), Word Knowledge (vocabulary words), Word Analysis (word identification), Mathematic Computation (math calculations), Spelling, and Language (punctuation, capitalization, and word usage)).

It is also a test of cognitive skills as well. Several Metropolitan subtests measure indirect cognitive consequences of learning, such as the Reading subtest (which is, in effect, paragraph comprehension), the Mathematics Problem-Solving subtest, and the Mathematics Concepts test (knowledge of math principles and relationships).

This is important because Kohn goes on to claim:

While children were also given other cognitive and psychological assessments, these measures were so poorly chosen as to be virtually worthless.

Even if the other cognitive and psychological assessments were "poorly chosen" it does not diminish the fact that the MAT is a well respected test of both basic and cognitive/conceptual skills, as ackowledged by subsequent researchers. The other cognitive/conceptual skills test used was the Raven's Colored Progressive Matrices, but it did not prove to discriminate between models or show change in scores over time.

Also, the affective skills were assessed using two instruments: the Intellectual Achievement Responsibility Scale (to assess whether children attribute their success (+) or failures (-) to themselves or external forces) and the Coopersmith Self-Esteem Inventory (to assess how children feel about themselves, the way they think other people feel about them, and their feelings about school).

Kohn buries the reason why he believes the cognitive and affective skills tests were "poorly chosen" in a footnote.

There is strong reason to doubt whether tests billed as measuring complex “cognitive, conceptual skills” really did so. Even the primary analysts conceded that “the measures on the cognitive and affective domains are much less appropriate” than is the main skills test (Stebbins et al., 35). A group of experts on experimental design commissioned to review the study went even further, stating that the project “amounts essentially to a comparative study of the effects of Follow Through models on the mechanics of reading, writing, and arithmetic” (House et al., 1978, p. 145). (This raises the interesting question of whether it is even possible to measure the conceptual understanding or cognitive sophistication of young children with a standardized test.)

Let's take these "strong reasons" in order. Even if the primary analysts believed that these tests were "much less appropriate" this doesn't they didn't measure what they purported to measure. There is no evidence that they didn't. It could also be that the primary researchers believed that the basic skills tests were best suited for measuring what K-3 students are typically expected to know. In any event, the opinion that the tests were "much less appropriate" does not lead one to conclude that the tests didn't measure "complex 'cognitive, conceptual skills.'” This is an empirical question and Kohn provides no empirical support for his conclusion.

Next, Kohn relies on the opinions of the "experts" commissioned and funded by the Ford Foundation (which I'll get to later) for the proposition that PFT only measured the affects of "the mechanics of reading, writing, and arithmetic." Apparently, what these experts were getting at was that students who hadn't learned the mechanics of reading, writing, and doing arithmetic might not be able to demonstrate their cognitive skills. This is also an empirical question. But, these experts provided no empirical support for their conclusion. Other researchers, however, did look into the question once it was raised.

Conceivably, certain models-let us say those that avowedly emphasize "cognitive" objectives-are doing a superior job of teaching the more cognitive aspects of reading and mathematics, but the effects are being obscured by the fact that performance on the appropriate subtests depends on mechanical proficiency as well as on higher-level cognitive capabilities. If so, these hidden effects might be revealed by using performance on the more "mechanical" subtests as covariates.

This we did. Model differences in Reading (comprehension) performance were examined, including Word Knowledge as a covariate. Differences in Mathematics Problem Solving were examined, including Mathematics Computation among the covariates. In both cases the analyses of covariance revealed no significant differences among models. This is not a surprising result, given the high correlation among Metropolitan subtests. Taking out the variance due to one subtest leaves little variance in another. Yet it was not a forgone conclusion that the results would be negative. If the models that proclaimed cognitive objectives actually achieved those objectives, it would be reasonable to expect those achievements to show up in our analyses.

So, again we have no valid reason for discounting the results of the non-basic skills tests. Unsupported opinion is not a valid reason last I checked.

Lastly, Kohn raises a question:

This raises the interesting question of whether it is even possible to measure the conceptual understanding or cognitive sophistication of young children with a standardized test.

And then conspicuously fails to answer it. This is the poor man's version of debate. Moreover, nothing that precedes this "interesting question" is capable of actually raising it.Also , I'm not sure if Kohn is trying to claim that you can't measure these skills or that you can't measure these skills with a standardized test. Though, Kohn provides no support for either question.

Kohn's innuendo is that the students might have had conceptual understanding or cognitive sophistication that we must accept on faith despite the evidence that more students in the non-DI models were incapable of demonstrating these magical immeasurable skills on simple tests of comprehension of written paragraphs, mathematical problem-solving and knowledge of math principles and relationships, not to mention all the other "basic skills" that were measured.

Unfortunately, Kohn fails on all three counts to provide evidence that would compel a reader to follow him down his opinionated path that PFT only measured basic skills and that the other measures were "virtually worthless." Maybe this is why he buried this one in a footnote.

Next Kohn claims:

Some of the nontraditional educators involved in the study weren’t informed that their programs were going to end up being judged on this basis.

First of all, the DI educators were also non-traditional in as much as the other models' educators were. DI is about as far removed from traditional pedagogy as the other models.

Also, even if some of the other educators claimed that they were never initially told that their models weren't going to be judged on reading comprehension, math problem solving, and the like, they would have quickly learned what was coming down the pike since the PFT students were extensively tested throughout the study. And, it was the third and fourth cohorts that formed the cohorts of the evaluation. Whoever claims to not have known initially would certainly have found out during the time time the first two cohorts passed through.

Next Kohn claims:

The Direct Instruction teachers methodically prepared their students to succeed on a skills test and, to some extent at least, it worked.

Actually, the DI model systematically prepared their students to read, understand the conventions of language, to spell, and to do arithmetic with an emphasis "placed on the children's learning intelligent behavior rather than specific pieces of information by rote memorization." And, the students outperformed the other students on tests of sound-symbol relationships, vocabulary words, word identification, math calculations, spelling, punctuation, capitalization, word usage, paragraph comprehension,mathematics problem-solving , knowledge of math principles and relationships, and the affective measures. There was no evidence that the DI students engaged in test preparation as alluded to by Kohn.

PFT demonstrated, once again, that teaching these skills directly was more effective than teaching them obliquely which was what the other models believed would lead to superior performance. It turns out they were wrong and they continue to be wrong to this day.

For those keeping track at home, Kohn has now failed to establish the first two prongs of his argument. He has one more prong left which I'll take up in my next post.


Anonymous said...

Interesting read. I appreciate the change in tone, and, although I'm not convinced about the whole "Murray Gell-Mann Amnesia effect" thing, I'm definitely more likely to look at Kohn with a more skeptical eye.

Anonymous said...

It isn't just Kohn who is afflicted with Gell-Mann Amnesia. The malady is pandemic across EdLand.

In his autobiographical history of Follow Through, Zig presents persuasive evidence that the consequences were all about politics and nothing about the results. So it is today!

DI was not the only casualty in the Reading War. Jeanne Chall modestly declared "Mission Accomplished" in 1967.


"With beginning reading instruction now on the national agenda, the Carnegie Corporation funded a study that Chall conducted from 1962 to 1965. She reviewed the existing research, described methods of instruction, interviewed leading proponents of various methods, and analyzed two leading reading series of the late 1950s and early 1960s. The results appeared in her Learning to Read: The Great Debate (1967).
Chall identified what she called "the conventional wisdom" of reading instruction: that children should read for meaning from the start, use context and picture clues to identify words after learning about fifty words as sight words, and induce letter–sound correspondences from these words. Like Flesch, she concluded that this conventional wisdom was not supported by the research, which found phonics superior to whole word instruction and "systematic" phonics superior to "intrinsic" phonics instruction. She also found that beginning reading was different in kind from mature reading–a conclusion that she reaffirmed in her Stages of Reading Development (1983), which found that children first learn to read and then read to learn. She recommended in 1967 that publishers switch to a code-emphasis approach in children's readers, which would lead to better results without compromising children's comprehension."

That was over 40, yes FORTY friggin YEARS ago. While "phonics" found favor in the late 1990's, the favor was again overshadowed by politics. Whole Language just put on the mask of Balanced Literacy and continues to thrive.

Once again, Alfie has a point here and there. It's not reasonable to attribute DI variability within schools and districts and among districts to "demographic factors and experimental error." Demographic factors and experimental error aren't causal. DI is causal.

And without trying to sort out the details, some of what Kohn has to say about the tests involved can be substantiated.

So if he cares to Kohn could well counter punch, Ken. And the deRosa-Kohn skirmish will continue as a small part of the larger Wars.

Anonymous said...

There are even more pertinent points in the Chrichton speech you flagged Ken:

“Endless presentation of conflict may interfere with genuine issue resolution. There is evidence that the television foodfights not only don't represent the views of most people-who are not so polarized-but may tend to make resolution of actual disputes more difficult in the real world. At the very least, they obscure the recognition that we resolve disputes every day. Compromise is much easier from relatively central positions than it is from extreme and hostile, conflicting positions: Greenpeace vs the Logging Industry.”
Your indirect exchanges with Kohn may not be a foodfight, but they seem pretty close to me. More Chrichton:

“Let me point to a demonstrable bad effect of the assumption that nothing is really knowable. Whole word reading was introduced by the education schools of the country without, to my knowledge, any testing of the efficacy of the new method. It was simply put in place. Generations of teachers were indoctrinated in its methods. As a result, the US has one of the highest illiteracy rates in the industrialized world. The assumption that nothing can be known with certainty does in truth have terrible consequences.

Yep, Kohn and many others could well read that and take it to heart. If Crichton can see things like this, you'd thing Kohn et al could see them. More:

“As GK Chesterton said (in a somewhat different context), "If you believe in nothing, you'll believe in anything." That's what we see today. People believe in anything."
Actually when it come to education, people believe worse than nothing. Most believe authorities who annually proclaim, "We're making gains." And arbitrarily set cut scores on ungrounded tests are treated as degrees of "proficiency. More:

“. . . since we're awash in this contemporary ocean of speculation, we forget that things can be known with certainty, and that we need not live in a fearful world of interminable unsupported opinion. But the gulf that separates hard fact from speculation is by now so unfamiliar that most people can't comprehend it.”
Go figure.

KDeRosa said...

Once again, Alfie has a point here and there.

No doubt. These social science experiments are always messy.

But, and here's the important point, Kohn's points do not add up to hard evidence that there is no empirical support for DI.

It's not reasonable to attribute DI variability within schools and districts and among districts to "demographic factors and experimental error." Demographic factors and experimental error aren't causal. DI is causal.

Why not. If half the population were black and the other half northeast Asian, so you really think there will be no variability between both groups on, say, a math exam?

And without trying to sort out the details, some of what Kohn has to say about the tests involved can be substantiated.

Again no test is perfect, but Krohn's criticisms amount to quibbles at the edges of reliability rather than hard evidence that there is no empirical support for DI, as he claims.

Dick Schutz said...

"If half the population were black and the other half northeast Asian, so you really think there will be no variability between both groups on, say, a math exam?"

That misses the point. Race is not causal in instruction. It's the characteristics of the instructional program that are causal.

Zig made the decision early on that scripting is "the only way to go" because of the difficulty experienced in "training teachers."
Scripting is "a way" but it's not the "only" way to go.

The story is told of BF Skinner. A student told him that a rat wasn't performing the way Skinner's textbook said it should. Skinner told him, "The rat is right. I was wrong. The rat is always right."

That doesn't imply that you always have to go with teachers. It does imply that it's not reasonable to treat variability as "user error."

"Krohn's criticisms amount to quibbles at the edges of reliability rather than hard evidence that there is no empirical support for DI, as he claims."

Fully agree..

Tracy W said...

Dick, you've evidently managed to annoy me enough that I am willing to devote hours to poking holes into your statements. Which is probably a good thing, overall.

You keep talking about "arbitrarily set cut scores on ungrounded tests are treated as degrees of "proficiency."
You don't say specifically which tests you are talking about. But out of curiousity and irritation I got driven into digging around in the NAEP, and those guys publish a lot of data on what the tests try to measure, the differences between different skill sets, etc. See http://nagb.org/publications/frameworks.htm

Now there always is an inherent aspect of arbitrariness in any achievement test. Why do we choose to teach some skills and not others, given the vast range of things that could be taught and be useful to at least one person, somewhere? But the NAEP tests at least appear to be based on consulting with experts and they have some explanation in their framework of why they chose to measure certain things.

I think you are being overly-hasty in dismissing achievement tests as ungrounded.

KDeRosa said...

That misses the point. Race is not causal in instruction. It's the characteristics of the instructional program that are causal.

I'm not I understand, Dick. Are you saying that instructional programs should have the same effect on all groups of students regardless of innate and/or SES characteristics or their abilities coming into school?