In the comments of the recent Willingham-Kohn dust-up, edu-blogger Stuart Buck brought up DI and Kohn responded by citing this article by him which immediately reminded me of the Murray Gell-Mann Amnesia effect.
The late Michael Crichton once gave a speech describing what he termed the Murray Gell-Mann Amnesia effect.
Briefly stated, the Gell-Mann Amnesia effect is as follows. You open the newspaper to an article on some subject you know well... You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect. I call these the "wet streets cause rain" stories. Paper's full of them.
In any case, you read with exasperation or amusement the multiple errors in a story, and then turn the page to national or international affairs, and read as if the rest of the newspaper was somehow more accurate about Palestine than the baloney you just read. You turn the page, and forget what you know.
I know enough about the research on DI to know that Kohn's description of the DI research qualifies as one of the worst hatchet jobs in education policy advocacy. As such, it should serve as evidence that Alfie Kohn might not be a trustworthy source on education policy and that his analysis of education research should be closely scrutinized in order to stave off the Murray Gell-Mann Amnesia effect.
But don't take my word for it, let's review Kohn's description of the DI research.
After we get past an initial paragraph of over-heated inflammatory language, Kohn's first argument is related to the results of Project Follow Through (PFT) is:
Of course, even if these results could be taken at face value, we don’t have any basis for assuming that the model would work for anyone other than disadvantaged children of primary school age.
But, PFT involved some schools in middle class neighborhoods and middle-class kids took part and were evaluated as part of the research. In fact, a very diverse set of students were evaluated.
The DI model was the best performing model for "disadvantaged children" as Kohn acknowledges, but it was also the best performing model for high-performing students, White children, Native Americans, African-American students, Hispanic students, English language learners, urban children, rural children, and the very lowest of the disadvantaged children. So there is a basis, a good basis, for assuming that the DI model works for most primary school aged children. And, in fact, these results have been replicated numerous times in subsequent studies which Kohn also fails to acknowledge.
Kohn's next argument is a repetition of the "variability" argument that others have levied against the PFT results:
To begin with, the primary research analysts wrote that the “clearest finding” of Follow Through was not the superiority of any one style of teaching but the fact that “each model’s performance varies widely from site to site.” In fact, the variation in results from one location to the next of a given model of instruction was greater than the variation between one model and the next. That means the site that kids happened to attend was a better predictor of how well they learned than was the style of teaching (skills-based, child-centered, or whatever).
This is spurious conclusion because most of the variability is attributed to the inclusion of two cohorts from the Grand Rapids site in the analysis which had severed ties with the DI sponsor well before the end of the study. This is well documented. The data from the Grand rapids site was the only "DI site" with low performance (a half standard deviation below the mean of the other DI sites). The Grand Rapids site is the only "DI site" that consistently fell below national norms. In fact, most of the variability in the remaining DI sites is above National norms.
In addition, subsequent research, involving researchers with at least some professional/reputational ties to DI, has shown that the variability between sites is mostly attributable to demographic factors and experimental error, and not to the DI program.
We disagree with both Abt and House et al. in that we do not find variability among sites to be so great that it overshadows variability among models. It appears that a large part of the variability observed by Abt and House et al. was due to demographic factors and experimental error. Once this variability is brought under control, it becomes evident that differences between models are quite large in relation to the unexplained variability within models.
In any event, even if the variability finding was characterized as the main finding by the primary researchers, this in no way diminishes the finding that DI was the superior performing program across the board for all measures tested for all groups tests. It's still a valid finding and has been upheld by numerous researchers examining the findings since the initial evaluation.
So, best performing program and the only program whose variability was mostly above national norms does not a valid criticism make. Strike two for Kohn.
Next Kohn, attacks the testing instruments used in PFT:
Second, the primary measure of success used in the study was a standardized multiple-choice test of basic skills called the Metropolitan Achievement Test. [(MAT)]
The MAT is not just a test of basic skills (such as Listening for Sound (sound-symbol relationships), Word Knowledge (vocabulary words), Word Analysis (word identification), Mathematic Computation (math calculations), Spelling, and Language (punctuation, capitalization, and word usage)).
It is also a test of cognitive skills as well. Several Metropolitan subtests measure indirect cognitive consequences of learning, such as the Reading subtest (which is, in effect, paragraph comprehension), the Mathematics Problem-Solving subtest, and the Mathematics Concepts test (knowledge of math principles and relationships).
This is important because Kohn goes on to claim:
While children were also given other cognitive and psychological assessments, these measures were so poorly chosen as to be virtually worthless.
Even if the other cognitive and psychological assessments were "poorly chosen" it does not diminish the fact that the MAT is a well respected test of both basic and cognitive/conceptual skills, as ackowledged by subsequent researchers. The other cognitive/conceptual skills test used was the Raven's Colored Progressive Matrices, but it did not prove to discriminate between models or show change in scores over time.
Also, the affective skills were assessed using two instruments: the Intellectual Achievement Responsibility Scale (to assess whether children attribute their success (+) or failures (-) to themselves or external forces) and the Coopersmith Self-Esteem Inventory (to assess how children feel about themselves, the way they think other people feel about them, and their feelings about school).
Kohn buries the reason why he believes the cognitive and affective skills tests were "poorly chosen" in a footnote.
There is strong reason to doubt whether tests billed as measuring complex “cognitive, conceptual skills” really did so. Even the primary analysts conceded that “the measures on the cognitive and affective domains are much less appropriate” than is the main skills test (Stebbins et al., 35). A group of experts on experimental design commissioned to review the study went even further, stating that the project “amounts essentially to a comparative study of the effects of Follow Through models on the mechanics of reading, writing, and arithmetic” (House et al., 1978, p. 145). (This raises the interesting question of whether it is even possible to measure the conceptual understanding or cognitive sophistication of young children with a standardized test.)
Let's take these "strong reasons" in order. Even if the primary analysts believed that these tests were "much less appropriate" this doesn't they didn't measure what they purported to measure. There is no evidence that they didn't. It could also be that the primary researchers believed that the basic skills tests were best suited for measuring what K-3 students are typically expected to know. In any event, the opinion that the tests were "much less appropriate" does not lead one to conclude that the tests didn't measure "complex 'cognitive, conceptual skills.'” This is an empirical question and Kohn provides no empirical support for his conclusion.
Next, Kohn relies on the opinions of the "experts" commissioned and funded by the Ford Foundation (which I'll get to later) for the proposition that PFT only measured the affects of "the mechanics of reading, writing, and arithmetic." Apparently, what these experts were getting at was that students who hadn't learned the mechanics of reading, writing, and doing arithmetic might not be able to demonstrate their cognitive skills. This is also an empirical question. But, these experts provided no empirical support for their conclusion. Other researchers, however, did look into the question once it was raised.
Conceivably, certain models-let us say those that avowedly emphasize "cognitive" objectives-are doing a superior job of teaching the more cognitive aspects of reading and mathematics, but the effects are being obscured by the fact that performance on the appropriate subtests depends on mechanical proficiency as well as on higher-level cognitive capabilities. If so, these hidden effects might be revealed by using performance on the more "mechanical" subtests as covariates.
This we did. Model differences in Reading (comprehension) performance were examined, including Word Knowledge as a covariate. Differences in Mathematics Problem Solving were examined, including Mathematics Computation among the covariates. In both cases the analyses of covariance revealed no significant differences among models. This is not a surprising result, given the high correlation among Metropolitan subtests. Taking out the variance due to one subtest leaves little variance in another. Yet it was not a forgone conclusion that the results would be negative. If the models that proclaimed cognitive objectives actually achieved those objectives, it would be reasonable to expect those achievements to show up in our analyses.
So, again we have no valid reason for discounting the results of the non-basic skills tests. Unsupported opinion is not a valid reason last I checked.
Lastly, Kohn raises a question:
This raises the interesting question of whether it is even possible to measure the conceptual understanding or cognitive sophistication of young children with a standardized test.
And then conspicuously fails to answer it. This is the poor man's version of debate. Moreover, nothing that precedes this "interesting question" is capable of actually raising it.Also , I'm not sure if Kohn is trying to claim that you can't measure these skills or that you can't measure these skills with a standardized test. Though, Kohn provides no support for either question.
Kohn's innuendo is that the students might have had conceptual understanding or cognitive sophistication that we must accept on faith despite the evidence that more students in the non-DI models were incapable of demonstrating these magical immeasurable skills on simple tests of comprehension of written paragraphs, mathematical problem-solving and knowledge of math principles and relationships, not to mention all the other "basic skills" that were measured.
Unfortunately, Kohn fails on all three counts to provide evidence that would compel a reader to follow him down his opinionated path that PFT only measured basic skills and that the other measures were "virtually worthless." Maybe this is why he buried this one in a footnote.
Next Kohn claims:
Some of the nontraditional educators involved in the study weren’t informed that their programs were going to end up being judged on this basis.
First of all, the DI educators were also non-traditional in as much as the other models' educators were. DI is about as far removed from traditional pedagogy as the other models.
Also, even if some of the other educators claimed that they were never initially told that their models weren't going to be judged on reading comprehension, math problem solving, and the like, they would have quickly learned what was coming down the pike since the PFT students were extensively tested throughout the study. And, it was the third and fourth cohorts that formed the cohorts of the evaluation. Whoever claims to not have known initially would certainly have found out during the time time the first two cohorts passed through.
Next Kohn claims:
The Direct Instruction teachers methodically prepared their students to succeed on a skills test and, to some extent at least, it worked.
Actually, the DI model systematically prepared their students to read, understand the conventions of language, to spell, and to do arithmetic with an emphasis "placed on the children's learning intelligent behavior rather than specific pieces of information by rote memorization." And, the students outperformed the other students on tests of sound-symbol relationships, vocabulary words, word identification, math calculations, spelling, punctuation, capitalization, word usage, paragraph comprehension,mathematics problem-solving , knowledge of math principles and relationships, and the affective measures. There was no evidence that the DI students engaged in test preparation as alluded to by Kohn.
PFT demonstrated, once again, that teaching these skills directly was more effective than teaching them obliquely which was what the other models believed would lead to superior performance. It turns out they were wrong and they continue to be wrong to this day.
For those keeping track at home, Kohn has now failed to establish the first two prongs of his argument. He has one more prong left which I'll take up in my next post.