D-Ed Reckoning: Education Research

November 15, 2006

Education Research

Recently I criticized this post by Stephen Downes of Half And Hour which claimed:

Certainly, any approach to learning theory that suggests that an experiment can be conducted in (say) a double-blind model in order to test hypotheses in terms of (say) achievement of learning outcomes in my view demonstrates a fundamental misunderstanding of the nature of the enquiry.

Stephen stopped by in the comments and defended his position

My arguments are not without foundation and evidence. What I am criticizing (as one who knows the field understands) is a particular approach to testing and evidence that has been subject to widespread criticism both inside and outside the sciences.

and suggested that we drop by his site and see what he had to say. So, I did. I found this.

That education is a complex phenomenon, and therefore resistant to static-variable experimental studies, does not mean that it is beyond the realm of scientific study. It does mean, however, that the desire for simple empirically supported conclusions (such as, say, "experiments show phonics is more effective") is misplaced. No such conclusions are forthcoming, or more accurately, any such conclusion is the result of experimental design, and not descriptive of the state of nature.

I'm not convinced. Education research is execrable most of the time. But, legitimate education research does exist which permits general conclusions to be drawn. Often, in education research, we are content if we learn whether an intervention increases student performance. We don't necessarily need to know why the intervention works.

For example, let's put together a hypothetical education experiment for a reading intervention for grades K-2. Let's call it the RITE program. The study will consist of 4000 students in the intervention group and 4000 students in the control group spread over many classrooms with many teachers. We will making sure demographic factors and other student factors are taken into account when splitting up the groups. The control group will be given a "research based, phonics reading program." The measurement tool we'll use is the SAT-9 which appears to be a good measure of reading ability. All students will receive pre-tests and post-tests to measure student achievement. For good measure we'll have an external evaluator conduct the study to reduce the bias effect. Here are the results of our study:

The First bar is for the control group, 36% performed better than than the 50th percentile while 33% performed below the 25th percentile. The next bar is for the intervention group who've gotten one year of the intervention. This group performs about the same as the control group. The next bar is for the intervention group in the program for two years; performance is starting to improve, but the students are not quite up to national norms yet. The last bar shows the intervention group who've been in the program for all three years. This group performs above the national norm -- 61% of students are performing above the 50th percentile, while only 14% are performing below the 25th percentile.

The effect size of the three year intervention group is over a standard deviation which is a large effect size for educational interventions and practically unheard of in education. Due to our large sample size, our results are statistically significant at a high level and we can confidently achieve them by faithful replicating the intervention. We don't need to know whether the intervention group used a whole language program or a phonics based program, whether the students were exposed to rich literature, or any other messy detail. Such details, though important, are irrelevant to us. As are other external factors, because whatever factors affected the intervention group also affected the control group.

And, while it is possible that other interventions might work as well or better than this one, we know with a high degree of certainty that this intervention will significantly boost student achievement.

By the way, the study is real. RITE stands for the Rodeo Institute for Teacher Excellence (Houston). The evaluator was the Texas Institute for Measurement, Evaluation, and Statistics. The report was published in 2002. See more about it here (and here).

If we had more classroom research like this and if schools adopted successful research-based interventions with fidelity instead of trying to extrapolate out the parts they think are the cause for success (invariably they're wrong), student achievement could be markedly improved in this country.

7 comments:

Stephen Downes said...: Without launching into a big long discussion (again, this is old ground for people familiar with empirical research), there are many reasons to question the conclusions offered by such studies.

No single study should ever be accepted as proving one or another hypothesis, no matter how large the sample size. The essence of empirical science is that the phenomena it describes must be replicable, which means they must actually be replicated. It is not enough to say that the odds of getting a different result are very small. The replication, by different scientists, in different circumstances, needs to be undertaken.

That is why I am a bit surprised that you would not at least attempt to demonsrate replicability of the theories tested in the RITE evauations. You could certainly, for example, point to the Direct Instruction & Open Court Fact Sheet, which documents numerous studies supporting Direct Instruction. Of course, it also cites a number of studies supporting Open Court.

So what do you do when there are many studies, all apparently equally rigorous, but which support different theories? One tactic is to start calling other theorists names - that's what I see happening in the Engelmann paper, for example. Or this author, who writes, "if we could only get the morons (ed perfessers, district pinheads) out of education…"

But people who are serious about learning won't resort to this. They focus on what makes studies more likely to be accurate or misleading, and even more importantly, they understand the limits of such studies.

For example, the first question when we look at comparisons is, "better than what?" For example: "'There's evidence here that Direct Instruction is definitely helping some of the students,' said Ms. Mac Iver, an associate research scientist at Johns Hopkins University in Baltimore. 'The issue we still need to ferret out is whether they're doing significantly better than students getting other types of instruction.'"

That's a significant problem. There is no shortage of studies proving the effectiveness of this or that educational methodology. How can this be?

The problem is, as I noted in my earlier comment, that the studies are insufficient to prove that one form of instruction is uniquely the best. Even controlling for variables, the design of such studies is not able to encompass the many factors involved in their success or failure.

Let's take the RITE studies as an example.

In the RITE studies, schools chose the model they wanted to support, then received $750 to implement the model. This means that schools using Direct Instruction at least nominally supported direct instruction, and had some funds to support this implementation. Does Direct Instruction work in schools that don't have additional funds and where the staff are not motivated?

In reports of implementation of Direct Instruction, the results are uneven - some schools (especially those with impoverished kids) the results are very good, but in other schools they are not so good. Even if overall Division scores have improved, can we conclude that RITE should be used for all schools, or only those with impoverished students?

In the RITE program, the test used to evaluate the students was the SAT-9. This test measures a certain competency in language learning. Does the study show us that this is an appropriate standard? Can the same improvements be detected in other tests? Is there a corresponding improvement in student college entrance exam essays?

Achievements in language learning can be measured in many ways. The measurement described here, "better than than the 50th percentile [and] below the 25th percentile" is an odd sort of measurement to use in what ought to be an objective evaluation, since these are relative measures, and not indicative of any concrete accomplishment. Why would the examiners not simply report improvements in actual scores on tests?

Direct Instruction required adherence to a very specific and orchestrated type of instruction. Is it possible that this sort of instruction works better in certain U.S. communities than others? Does the Texas test tell us it would work in an environment like a Canadian school, where students are much less likely to take orders?

The test results cited only covered results over a three year period. can we know, from this test, whether language learning continues to imporve at this pace? Is it show that Direct Instruction should (or should not) be used at higher grade levels? Does the RITE evaluation tell us whether success at lower levels using Direct Instruction translates to success at higher levels, or whether it translates to poorer results at higher levels?

Does the RITE evaluation reveal to us whether there are any negative effects from Direct Instruction? For example, both mathematics and grammar require formalism, and Direct Instruction de-emphasizes formalism. Does the RITE evaluation tell us whether there are therefore any effects on math scores, either right away or in the long term?

Does the RITE evaluation show us that Direct Instruction works for every single student? Does it actually harm some students? Should it be used anyways? If not, what shoudl be used? Will the use of this alternative impact the deployment of RITE?

How much does Direct Instruction cost? I saw a figure somewhere saying it takes $60K to implement fully. Is Direct Instruction the most effective way to spend 60K in a school? Are there other interventions - such as the provision of free hot lunches for students, which have been shown in numerous studies to far outweigh any pedagogical intervention? Does the RITE evaluation tell us about the relative merits of such interventions?

In some cases, students are not subjected to a school program at all (the clearest example being the Montessori program). Does the RITE evaluation show better performance on overall school achievement than the Montessori program?

I could continue with my questions, but I think the implications are clear:

- one single test, such as the RITE evaluation described here, leaves far too many questions unanswered

- we could conduct more tests, but what we find, empirically, is that different tests, under different conditions, produce different results

- there is no set of independent, empirically measurable, criteria to tell us which test to use. The tests assume the conclusion they are trying to prove - they do this in the way terms are defined and in the way variables are measured

- tests attempt to control for variables, but in environments of mutually dependent variables, controlling one variable actually changes the result of all the other variables

- in particular, the tests define a certain domain of acceptable alternatives - anything outside this domain cannot be contemplated. But better alternatives may exist outside the domain.

Now I don't like this any more than you do. I would love to be able to recommend phonics (I was actually schooled using phonics) or Direct Instruction or whole language or whatever to schools and teachers, if I knew it would work. But he more I look at this field, the more I understand that this knowledge is not forthcoming, not because we can't find out, but because there is no such fact of the matter.

Asking about the best way to teach language is like asking for the best letter of the alphabet. Trying to measure for the best way to teach language is like trying to find the warmest or the coldest letter.

Now, I don't expect that any of this has convinced you. It is difficult to abandon the idea of simplicity and certainty in science. But complex phenomena are real, just like the weather, and learning is one of them, which means that any simple cause and effect theory will be, by that fact, wrong.; November 16, 2006 9:20 PM
Stephen Downes said...: Correction (stupid Google Blogger comment writing Window).

The link in this came out incorrectly: Or this author, who writes, "if we could only get the morons (ed perfessers, district pinheads) out of education…"

It should be: "Or this author, who writes, "if we could only get the morons (ed perfessers, district pinheads) out of education…""; November 16, 2006 9:28 PM
Anonymous said...: " ...the best way to teach language ..."

Who expects to find the "best way"? Standardized tests are so simple and the results are so bad that nobody needs the "best way".

As the educational bar gets lowered, so does the need for perfect research. Besides, many of the problems have to do with assumptions and educational philosophy, not valid research.

"But complex phenomena are real, just like the weather, and learning is one of them, which means that any simple cause and effect theory will be, by that fact, wrong."

You're making this research stuff way too important. How complex is the research needed to get kids to learn to tie their shoes by third grade? How about learn their adds and subtracts by third grade, or the times table by fifth grade?

Educational results are so bad that talk of perfect research changes the focus away from assumptions and competence. If USEFUL research is so difficult or complex, then let's just give the money back to the parents and let them decide.

In research there are two types of accuracy; absolute and relative. Many of the engineering simulations I have done (with huge numbers of variables) give poor absolute results, but very useful relative results.; November 16, 2006 10:38 PM
KDeRosa said...: Stephen, education is more like engineering than astrophysics.

We're not necessarily concerned with what is "best," we'll settle for what reliably works. It could be that ten different interventions work equally as well. This would be a good thing.

In engineering/education type research, we are willing to permit a certain amount of uncertainty since we're dealing with human lives, children no less. We settle for well designed social science experiments and quasi-experimental designs, knowing that our results will be qualified. So in the RITE study, fidelity of implementation no doubt influenced achievement. Some schools probably raised achievement by 1/2 sd while other increased achievement by 1.5 sd. On average, however, the study shows that we can raise achievement for kids like the ones in the study, with teachers like those in the study, with funding levels like those in the study, to name but a few of the uncertainties.

And, of course, the intervention used in the RITE study has been studied numerous other times (as you've pointed out) and, on average, we consistently see similar results, increasing the reliability of the results.

As far as the the name calling goes, Engelmann and Kozloff are not criticizing those in education with equally rigorous studies, but rather those who take a completely unscientfic approach, i.e., the ed perfessers.

This is why I find your position so curious, you are trying to hold social science research to an impossibly high standard, a standard that it is not necessary to meet to get acceptable results, in an apparent attempt to discredit all education research in order to place all the failed and/or unproven educational theories (at least by the lowered standards we permit in social science research) on the same footing as those that have been validated.

I'm supposing the reason for this insistence on false rigor is to excuse yourself from the need to prove (using even the lowered social science standrds) the efficacy of the educational programs you're promoting or endorsing.

Am I wrong?; November 17, 2006 8:33 AM
Anonymous said...: I want to add that at the other end of the useful research spectrum are those studies that show any kind of (relative) improvement. Hence the studies that celebrate an improvement when schools use Everyday Math. It apparently doesn't bother them that the kids are still not prepared for a proper algebra course in 8th grade.

It's not that we need the best methods, or ones that just show any kind of improvement. We need methods that show results based on meeting externally defined (world-class) grade-level expectations. There may be many methods that meet these criteria. If the standards are low enough (think of state proficiency cutoff levels), then results have more to do with competence and accountability, than with best methods.

Unfortunately, many in the education world cannot bring themselves to define any specific sort of educational goal. The fuzzy variables grow and the usefulness of any research is questionable.; November 17, 2006 12:23 PM
Stephen Downes said...: Stephen, education is more like engineering than astrophysics. We're not necessarily concerned with what is "best," we'll settle for what reliably works. It could be that ten different interventions work equally as well. This would be a good thing.

Yes, it would be a good thing.

And I suppose I should not have hastily used the word 'best'. I, too, would settle for 'reliably works'. In my own defense, though, I would point out that it was the purpose of the article you cited to show that one particular system is 'best' - it's not something I just picked out of the air.

My point still holds: the study you cite does not prove that the method examined reliably works. The questions I posed in my previous comment still hold. And if the answers are wrong, the method could perform well in the evaluation and still be a disaster.

... in the RITE study, fidelity of implementation no doubt influenced achievement. Some schools probably raised achievement by 1/2 sd while other increased achievement by 1.5 sd. On average, however, the study shows that we can raise achievement for kids like the ones in the study, with teachers like those in the study, with funding levels like those in the study, to name but a few of the uncertainties.

No. You are saying there's a +- factor at work here, that there's some fuzziness around the edges. What I'm saying is that even with these results you might not even be in the same ballpark.

You are allowing the experimental results to convince you that the method improves outcomes. I am responding that it shows no such thing. It only appears to show improved outcomes, and this because you are viewing the results in a theoretical framework that does not allow any other possibility.

And, of course, the intervention used in the RITE study has been studied numerous other times (as you've pointed out) and, on average, we consistently see similar results, increasing the reliability of the results.

Right. And you won't. Nor either will any of the questions I posed ever be answered.

As far as the the name calling goes, Engelmann and Kozloff are not criticizing those in education with equally rigorous studies, but rather those who take a completely unscientfic approach, i.e., the ed perfessers.

Doesn't matter. Name-calling is what people do when they don't have an argument.

This is why I find your position so curious, you are trying to hold social science research to an impossibly high standard...

It's a high standard, yes, because society is filled with people whose lives have been ruined by botched educational theory. I would rather not see people's hopes and dreams wrecked before they are teenagers.

But it's not an 'impossibly high' standard. It's a different standard. We need to understand at the outset that we are not working with stars or bridges - we are working with things that are much more complex, so much more complex that analogies with these simple physical systems are really misleading.

The experimental construct of the RITE study (and the other studies) treats learning as though it were a relatively simple cause-effect system. Do A and reliably get B. But humans aren't like that - you can't even reliably get the same effect out of a single person, let alone different people.

...a standard that it is not necessary to meet to get acceptable results, in an apparent attempt to discredit all education research in order to place all the failed and/or unproven educational theories (at least by the lowered standards we permit in social science research) on the same footing as those that have been validated.

Um... huh?

When you write things like "... an apparent attempt..." you should recognize that you're going off the rails. If you don't know what I'm attempting to do, just ask. To presuppose that I'm trying to do one thing or another is to prejudice yourself without having considered what I am doing.

And what I am doing is this: I am pointing out that a type of research that is thought to be appropriate is not in fact appropriate. I am basing this on the fact that such research supports conclusions that, on balance, do not appear to be true.

I don't have to have a reason for ding this. I don't need to be motivated by a desire to promote something else. I don't even have to have the intention to "discredit" certain theories. I really don't care about that.

I'm supposing the reason for this insistence on false rigor is to excuse yourself from the need to prove (using even the lowered social science standrds) the efficacy of the educational programs you're promoting or endorsing. Am I wrong?

I am not 'promoting' educational programs. I am not 'promoting' anything. So yes, you're wrong.

As you look though my site you'll find a lot of ideas and opinions, true. These are based on my work and my observations. I am happy to have any of them challenged, put to the test. It won't matter much to me if I'm wrong - I'm not in this to be right about everything but rather to play my role in the larger enquiry. Oh course, I think I'm right. But I have no need (financial or otherwise) to 'discredit' or somehow respond to other theories inappropriately.

To me, when I say the study doesn't prove the conclusion, it's like saying 2 plus 2 doesn't equal 5. You can believe me or not, I don't really care, and I don't have a stake in what you go away thinking.

I don't know why you thought it would be appropriate to quote one paragraph out of the body of my work and highlight it for criticism. I'm glad that it led you to look at my work a little more deeply; that's the purpose of what I do, after all.

I think you have seen, as I asserted originally, that my work is not without foundation. I do have reasons for what I say, and sometimes what I say even has social and political implications. But my politics are based on my science; my science is not based on my politics.; November 17, 2006 9:14 PM
KDeRosa said...: My point still holds: the study you cite does not prove that the method examined reliably works. The questions I posed in my previous comment still hold. And if the answers are wrong, the method could perform well in the evaluation and still be a disaster.

We appear to be on the same page now, so let's see how whether the questions still hold up.

That's a significant problem. There is no shortage of studies proving the effectiveness of this or that educational methodology. How can this be?

There most certainly is a shortage of high quality education studies. Most education do not even come close to meeting the generally accepted standards for social science research. Just because your position is that all education research is crap, doesn't mean that that they are all equally crappy. Once we throw out the true crap, we see much less variance in results.

But, since we are looking for what is effective as opposed to what is best, this point is moot.

Does Direct Instruction work in schools that don't have additional funds and where the staff are not motivated?

The results assume a certain funding level, so if you want to achieve the same results you'd have to fund your school in a similar manner and provide whatever motivation that $750 bought you, most likely a de minimis change.

I do not contend that the results of this study are universally applicable. The validity of the study is constrained to the paramters under which the study was performed.

Even if overall Division scores have improved, can we conclude that RITE should be used for all schools, or only those with impoverished students?

The study is only valid for the type of schools that were part of the study. Most were likely poor inner city schools, but the pretest results suggest that some probably were not so impoverished.

Does the study show us that this is an appropriate standard? Can the same improvements be detected in other tests? Is there a corresponding improvement in student college entrance exam essays?

As I'm sure you're aware, the SAT-9 is well respected measure of student language achievement comparable to the CTBS or ITBS. I am aware of no basis for believing this particular measure is invalid.

DI reading programs have been studied using other reading tests with similar results.

Since, this study only went up to the second grade, instruction in grades 3-12 would be a confounding factor when trying to use a college entrance exam as a measure of achievement.

Why would the examiners not simply report improvements in actual scores on tests?

The graphs were a convenient summary of results. No doubt a better measure of results are available in the full report. And, the researcher should make the data set available to anyone who wants to look at it. That's standard operating procedure.

Is it possible that this sort of instruction works better in certain U.S. communities than others? Does the Texas test tell us it would work in an environment like a Canadian school, where students are much less likely to take orders?

This study is limited to students like those in Texas. It may not be applicable to students much different than those in the study. Though DI programs have been tested in many different schools and you can probably find a studied school similar to a school in Canada that may permit extapolation of results.

Then there's a bunch of questions that are mostly conjectureon future possible effects, negative and positive, of the intervention. They're all legitimate concerns but they don't call into question the reult of the study, which is 2nd grade language achievement. The questions imply further research.

Does the RITE evaluation show us that Direct Instruction works for every single student? Does it actually harm some students? Should it be used anyways? If not, what shoudl be used? Will the use of this alternative impact the deployment of RITE?

Looks like it didn't work at all or only partially worked for at least about 39% of students who still perform below the mean.

How much does Direct Instruction cost? I saw a figure somewhere saying it takes $60K to implement fully. Is Direct Instruction the most effective way to spend 60K in a school?

$60k is chump change for most school districts. Most of that goes to initial training, I believe.

Are there other interventions - such as the provision of free hot lunches for students, which have been shown in numerous studies to far outweigh any pedagogical intervention?

Where were these studies conducted? The Congo. The weight of authority here is that such hot lunch provisioning do little to affact the super-nutritioned kids in the U.S.

In some cases, students are not subjected to a school program at all (the clearest example being the Montessori program). Does the RITE evaluation show better performance on overall school achievement than the Montessori program?

The Montessori program is a school program. To find out relative peformance you'd have to find a legitimate study of K-2 Montessori students using a similar measure, with similar kids, with similar teachers, etc.

one single test, such as the RITE evaluation described here, leaves far too many questions unanswered

But no legitimate questions that call into question the efficacy of the study.

we could conduct more tests, but what we find, empirically, is that different tests, under different conditions, produce different results

Actually, the results of different tests under similar conditions appear to come out the same as the RITE study.

there is no set of independent, empirically measurable, criteria to tell us which test to use. The tests assume the conclusion they are trying to prove - they do this in the way terms are defined and in the way variables are measured

You need to explain this one better before I can critique it, much less buy it.

tests attempt to control for variables, but in environments of mutually dependent variables, controlling one variable actually changes the result of all the other variables

We know the variables in play here. Perhaps this language intervention consumes so much class time that it affects, say, math achievement. But, if you take this study together with all the rest of the DI research you see that this is not the case and that the DI intervention does not affect the variables we care abot in measuring education outcomes.

in particular, the tests define a certain domain of acceptable alternatives - anything outside this domain cannot be contemplated. But better alternatives may exist outside the domain.

Indeed, they might. This is only relevant id we are concerned with finding what's best as opposed to finding what works.

What I'm saying is that even with these results you might not even be in the same ballpark.

That result is possible, but highly unlikely given the sample size, assuming, of course, that the research methodology was sound.

I am responding that it shows no such thing. It only appears to show improved outcomes, and this because you are viewing the results in a theoretical framework that does not allow any other possibility

The study could have just as easily shown significantly lower results whihc is why we set the cut off at 1/4 standard deviation effect size for accepting an intevention as being effective. In this case we are well above that level.

Doesn't matter. Name-calling is what people do when they don't have an argument.

You mean like this? "When you write things like '... an apparent attempt...' you should recognize that you're going off the rails."

It's a high standard, yes, because society is filled with people whose lives have been ruined by botched educational theory. I would rather not see people's hopes and dreams wrecked before they are teenagers.

You need to distinguish between the crap and the not crap. The botched theory comes from all the bad studies that do not even come close to meeting the standards of the RITE study. You are painting with way too broad a brush and your conclusions are not justified by the evidence at hand.

But it's not an 'impossibly high' standard. It's a different standard. We need to understand at the outset that we are not working with stars or bridges - we are working with things that are much more complex, so much more complex that analogies with these simple physical systems are really misleading.

So, what exactly is your standard?

YOu must not have ever built a bridge if you think a bridge is a simple system. Calling a bridge a simple physical mechanism is not even close to being accurate. In fact, the variables that must be contended with in bridge construction are at least as complicate and mutually dependent and whatever other excuse you've brought up here with respect to education. Bridges don't fall down (much) because we follow the research which is really not much more certain than education research.

The experimental construct of the RITE study (and the other studies) treats learning as though it were a relatively simple cause-effect system. Do A and reliably get B. But humans aren't like that - you can't even reliably get the same effect out of a single person, let alone different people.

And neither is weather, wind, traffic, materials, and all the other variables that come into play everyday in bridge construction, and yet we perform reliable research on bridge construction all the time.

Acording to Engelmann, children act remarkable lawfully to given stimuli; that are not so quizotic as you might think. There is a range of expected behavior that we can expect and account for, just as there is a range of wind conditions and traffic condition we can account for in bridge design.

I am pointing out that a type of research that is thought to be appropriate is not in fact appropriate. I am basing this on the fact that such research supports conclusions that, on balance, do not appear to be true.

Hmmm. You haven't managed to find one false thing in the RITE study. You're dancing all around the periphery, but you're not attacking the study directly. Given conditions x, y, and z, the study shows that the intervention has an effect size of n on subject b using measure m. What part of that is false?

But I have no need (financial or otherwise) to 'discredit' or somehow respond to other theories inappropriately.

Fair enough.

But my politics are based on my science; my science is not based on my politics.

I don't remember saying this. And if you're saying it about me, you should show some support for that conclusion.; November 18, 2006 12:15 AM