June 7, 2006

Not So Stinky Research Part II

(This is part II of this post. Part I can be found here.)

In the last post, we discussed how we'd need to increase student performance by at least a full standard deviation in order to comply with NCLB. Bear in mind that this is considered to be a large effect size improvement, which are very rare in education research.

So are there any current educational interventions that are capable of improving performance by such a large amount?

Let's limit the discussion to research that has at least a moderate effect size (> 0.5 SD), statistically significant results, and at least three well-designed studies.

Believe it or not there are a few interventions that meet this criteria.

Fortunately, the American Institutes for Research has already done much of the hard work for us by evaluating the research for many popular elementary school interventions. AIR published the results in its November 2005 paper CSRQ Center Report on Elementary School Comprehensive School Reform Models. I went through the report and found three interventions that met my criteria.


Accelerated Schools

From the CSRQ Center Report:
The CSRQ Center reviewed 37 quantitative studies for effects of Accelerated Schools on student achievement. Three studies met CSRQ Center standards for rigor of research design. The Center considers the findings of these three studies conclusive, which means that the Center has confidence in the results of these studies. About one third of the results reported in these studies demonstrated a positive impact of Accelerated Schools on student achievement, and the average effect size for these significant results is +0.76.
The AS studies can be found in Appendix A of the report.


Success For All

From the CSRQ Center Report:
The CSRQ Center reviewed 115 quantitative studies for effects of SFA on student achievement. Thirty-three of these studies met CSRQ Center standards for rigor of research design. Upon review, the Center considers the findings of 31 of these studies conclusive, meaning the Center has confidence in the results reported. The findings of the other 2 studies are considered suggestive, which means the Center has limited confidence in them. Overall, the 33 studies report a mix of results showing positive effects and no effect of SFA; of the 91 separate achievement test findings across the 33 studies, just over half (52%) demonstrate a statistically significant positive impact. The average effect size of these positive effects is +0.63.
The SFA studies can be found in Appendix U of the report.


Direct Instruction

From the CSRQ Center Report:
The CSRQ Center reviewed 56 quantitative studies for effects of DI on student achievement. Twelve studies met CSRQ Center standards for rigor of research design. The Center considers the findings of 10 of these studies conclusive, which means that the Center has confidence in the results reported. The findings of two studies are considered suggestive, which means the Center has limited confidence in the results. The findings in the conclusive and suggestive studies showed mixed results: some studies demonstrated a positive impact of DI on student achievement and other studies showed no significant effects. About 58% of the findings reported in the studies that met standards demonstrated positive effects; the average effect size of those significant findings was +0.69.
The DI studies can be found in Appendix K of the report.

I also know that Gary Adams performed a meta-analysis on the DI research and examined 34 rigorous studies. Here's what he has written about the results.
On pages 48 and 51, the meta-analysis shows that 17 studies lasted less than a year and 17 lasted over a year. The effect size can be calculated per comparison and per study but all of the results show large effect sizes: .95 for studies less than a year and .78 for studies more than a year... On page 44, the age of the publications was analyzed (1972–1980: 6 studies, 1981–1990: 22 studies, 1991–1996: 6 studies) and all of the effect sizes were large (.73, .87, 1.00, respectively).
Fifteen of the studies were conducted by researchers who have been somehow connected with Direct Instruction. In contrast, the majority of the studies (18 studies) were conducted by non-DI-connected researchers. The effect size for studies by DI-connected researchers was .99—a large effect size. The effect size for studies by non-DI-connected researchers was .76—also a large effect size.

So, there are at least three educational interventions that are capable of achieving moderate to large effect size improvements on student performance, at least in the elementary grades.

Of course, all three of these interventions completely overhaul how schools are run, changing almost every aspect of the school. Maybe such radical change is necessary.

Does anyone else know of any other valid educational research with similar moderate to large effect sizes? The comments are open.

13 comments:

Ed Researcher said...

This set of posts is getting worse and worse because it keeps perpetuating the myth that effect size is a unit-free measure that does not depend on:
a) the reference population
b) the nature of the contrast

If you compare Curriculum A vs. Curriculum B and they cost the same, but use different philosophies (e.g. whole language vs. phonics; "traditional" vs. "new" math; etc.), then any statistically significant difference, no matter how small the effect size, is meaningful.

If you compare a new and costly intervention versus the status quo, then "statistically significant effect size of 0.25 sd" is not enough information for policy makers.

First (admittedly not a major issue in most cases), you have to know whether that is a standard deviation compared to a nationally normed sample or compared to a similar population of say, disadvantaged students. An effect size of 0.25 is a subsample (e.g. small state or minority students) is much smaller than an effect size of 0.25 in a diverse national sample.

Second, you have compare the effect size to that of another intervention that uses the money in some other way.

People often (over-)use class size reduction (using the TN STAR experiment) as their benchmark.

KDeRosa said...

it keeps perpetuating the myth that effect size is a unit-free measure that does not depend on ... the reference population [and] the nature of the contrast

No, I indicated that the research is only valid for the conditions under which the study was conducted. You can't necessarily extrapolate.

If you compare Curriculum A vs. Curriculum B and they cost the same, but use different philosophies (e.g. whole language vs. phonics; "traditional" vs. "new" math; etc.), then any statistically significant difference, no matter how small the effect size, is meaningful.

I disagree. Implementing a new program with a small effect size usually results in a real-world effect size of zero. That's why we have the educationally signficant cut-off of 1/4 sd. Chasing smaller effect sizes aren't worth the effort no matter what the cost.

Once we get over the 1/4 sd hump, then you're points are valid. All things being equal, you'd rather chose the intervention that cost less per unit of increase. But, interventions having effect sizes significantly greater than 1/4 sd ar few and far between. We're in a "get what you can get" position.

People often (over-)use class size reduction (using the TN STAR experiment) as their benchmark.

STAR is a double failure. Low effect size and high cost.

SteveH said...

Affluent parents get to do their own research based on a set size of one. The rest have to wait for someone else to do some valid research that will have a significant effect for all.

My point is that as a parent, "I don't need no stinking research." I don't want to wait until someone else decides (for me) what constitutes a good basic education.

Don't get me wrong, Ken. Reliable educational research is very important in determining what works best. But, "best" for what? Relative improvement for all, or best for each individual child, right now?

Actually, I have no issue unless "research" (good or bad) is being used to tell me what to do. My problem is that "research" is used to avoid the fundamental issues of educational assumptions and parental choice. A teaching methodology could have a statistically significant effect, but still not lead to algebra in 8th grade.

JohnL said...

KDeRosa, thanks for alerting me to this report. Yes, there are other interventions that produce large effects, as I've noted on Teach Effectively!, but these are not comprehensive school reform models.

I'd rather see an ES > than .25 as the threshhold. Researchers should probably pay attention when there's an average (over dozens of studies) ES of .25, but for practice I'd like to see something double that standard.

The Accelerate Schools ES is a little suspect to me, given that it's based on only 3 studies.

ed researcher, I agree that cost is an important factor in the reform equation. And, given the choice between a reform model that costs X and produces a negligible effect and one that costs 2X and produces a .5 effect, I'll go for the latter. In most cases, X doesn't equal $0. Communities are already spending lots of education, so adding $200K over three years is a small percentage increase, no?

Tracy W said...

My point is that as a parent, "I don't need no stinking research."

The downside being that if we (as a society as whole) don't do any research we don't know what works.

Medicine spent thousands of years doing things that often, once properly tested, turned out to make patients worse.

I don't want to wait until someone else decides (for me) what constitutes a good basic education.

I think it would be useful to make a distinction between two different things here.

One is what kids should be taught.

The other is how kids should be taught.

Research has a lot to do with the "how", but is much more limited in what it say about the "what".

So whether schools should be aiming at algebra in the 8th grade is a value judgement sort of thing, and research can only help by things like indicating whether or not kids' brains are capable of learning alegbra in the 8th grade.

But assuming that society has settled on a goal of algebra in the 8th grade, research can tell us a lot of useful stuff about how to most kids from 1st grade to passing algebra in 8th grade.

SteveH said...

"... and research can only help by things like indicating whether or not kids' brains are capable of learning alegbra in the 8th grade."

What can research tell us when the answer to this question is obviously yes. What percent of kids? Does it matter? This is my point. Assumptions. Is research going to be used to give kids the best lowest-common-denominator education for all, or the best educational opportunity for each individual child? My comments have absolutely nothing to do with whether research should or should not be done.

The problem is that some people think that proper ressearch is now the answer to all sorts of educational questions. Rather than discuss assumptions and expectations, some feel that all we need are proper research-based educational programs. This is about control, not research. I'm all for proper research, but not if it is used to continue to keep parents out of the loop. Proper research is no substitute for choice.

As for my own personal opinion, I think that most educational problems can be solved by common sense, not research. Define a proper curriculum, set very clear year-to-year standards, and expect both the teachers and kids to get down to work - not play learning.

Ed Researcher said...

Hmm. Ken missed my point entirely. Maybe I didn't explain it right.

Not all policy choices are between an innovation and the status quo. Sometimes you have cost-neutral choices and you have to pick one.

For example, you're starting a charter school, brand new curriculum, and you have to pick a textbook.

Example #2, you're trying to hire a teacher and you have interviewed two candidates, like them equally well and they are identical in all respects except one has an advantage in years of experience and the other has an advantage in education or credentials.

Example #3, you won an unrestricted grant for your school or district and you want to use it for reducing class size, buying a whole school reform package, or teacher training.

In all 3 cases you would like some education research to tell you whether one alternative is better than the others. You don't care about the magnitude of difference, you just want a tie-breaker. The 0.25 sd rule is not relevant.

I would argue that policymakers face these kinds of decisions all the time and we ed researchers have to know when to abandon arbitrary rules about effect sizes.

There are other cases where effect size rules are not useful. I'll save that for another comment or post.

KDeRosa said...

ed researcher, your examples are starting to shed some light now, but I still disagree with your underlying premise.

In education we have three main types of programs:

1. Stuff that works well and which usually has large effect sizes associated with its research.

2. Stuff that performs as well as the status quo. This stuff usually has small but positive effect sizes.

3. Stuff that doesn't work well. Typicially, this is the stuff with no or little research base. This stuff can linger on if it comports with the Ed fads.

What you're saying is that it's OK to pick ed programs that fall into group 2 if those are the only programs being considered.

What I'm saying is that if there are group I programs available, group 2 programs shouldn't even be considered. If we had such a thing as educational malpractice, picking a group 2 program would be per se negligence.

All things being equal, if a school is only selecting programs in group 2, it really doesn't matter which one they pick or what small effect sizes exist between them, in the real world student performance will be about the same or largely determined by extrinsic factors.

SteveH said...

Relative error is usually smaller than absolute error. It's easier to select between choice A versus choice B, but that says nothing about absolute accuracy or whether both choice A and B are bad. One can always limit research to these sorts of choices, but the results, however accurate, may be worthless once you remove the limitations or assumptions.

"For example, you're starting a charter school, brand new curriculum, and you have to pick a textbook."

Although I would be hard pressed to call this research, the results are all relative.


"Example #2, you're trying to hire a teacher and you have interviewed two candidates, like them equally well and they are identical in all respects except one has an advantage in years of experience and the other has an advantage in education or credentials."

This sounds more like assumptions, common sense, and judgment to me, rather than research - the research only being used to justify a value judgment. Imagine writing down an equation defining the merit function for this decision. What are your weighting factors? Whatever will give you the decision you want - all quite scientifically done.

"Example #3, you won an unrestricted grant for your school or district and you want to use it for reducing class size, buying a whole school reform package, or teacher training."

Choices A versus B versus C. All relative.

Research can be done scientifically but still not be worth the powder to ... well, you know. I don't want schools holding up "scientific" research as proof to do whatever they want.

"You don't care about the magnitude of difference, you just want a tie-breaker."

Maybe schools don't care about external or absolute magnitudes, but, as a parent, I do.

Catherine Johnson said...

Tracy

Medicine spent thousands of years doing things that often, once properly tested, turned out to make patients worse.


Actually, I think it's fair to say that medical research is quite weak. Every med student is taught that something like....is it 50%? of all studies published in JAMA in any given year are wrong.

Medicine is frequently said to be an art by "insiders" like me who, unfortunately, spend a huge amount of time with doctors. Doctors say the same thing.

Now that I know the name for Bayesian reasoning, I realize that's what I've always looked for in a physician; I look for a person who has "figured it out" through brains & years of experience.

Basically, I'm looking for the Alan Greenspan of medicine. (We've found him, btw. Eric Hollander. The Alan Greenspan of OCD.)

I think I tried to find the WSJ op-ed talking about Alan Greenspan's brain for you once, didn't I?

The new thing in medicine is exactly what we've seen happening for YEARS in education, which is the tyrrany of frequentism.....doctors are now supposed to do "evidence-based treatment." Period.

If a doctor's Bayesian brain tells him to do X instead of Y, and Y is what JAMA has been promoting for the past few years, he's supposed to ignore his experience and wisdom and defer to peer-reviewed papers.

From where I sit, that's very bad. Our family has already suffered from medications prescribed on the basis of peer-revieweed research instead of physician experience.

I'm in favor of research.....but I'm not clear on the limits of frequentist research, and I don't understand Bayesian, case history, or field research well enough to see which kinds of research we should trust to tell us what.

Catherine Johnson said...

Skimming this thread, I'm inclined to feel that, yet again, the ed schools are the problem.

Common sense simply isn't a stock in trade for ed schools. Ideology, politics, social justice - anything but common sense.

Medicine is radically diffferent.

In medicine, the doctor and the patient have one goal and one goal, only: make the patient better.

That's an oversimplification, of course. Doctors and patients can have conflicts over values and goals, too.

Doctors also have treatment paradigms that can outlast their usefulness by years.

Still, doctors are focused on the patient and what happens to the patient - and they treat patients as individuals, not as classes.

Individuals are constantly having wonky reactions to drugs the peer-reviewed literature say works great.

So doctors are continually seeing the "exceptions to the rule."

Doctors are probably almost forced to become Bayesians, if they weren't going in.

Catherine Johnson said...

Hey Ken!

I have the whole Distar Arithmetic series to look at today!

Tracy W said...

Sorry Catherine, realised I hadn't made a reply to this point.

Medicine spent thousands of years doing things that often, once properly tested, turned out to make patients worse.

Actually, I think it's fair to say that medical research is quite weak. Every med student is taught that something like....is it 50%? of all studies published in JAMA in any given year are wrong.


How do they know that 50% of all published studies are wrong?

Either the 50% figure is a made-up number, or it is the result of research itself.

I suspect the 50% figure arose from research, and refers to initial results that are reported in medical journals.

From where I sit, that's very bad. Our family has already suffered from medications prescribed on the basis of peer-revieweed research instead of physician experience.

I've suffered from medications not being prescribed in the first place due to physician experience being wrong.

Individuals are constantly having wonky reactions to drugs the peer-reviewed literature say works great.

Medicine can only work with averages.

And not all the wonky reactions are the fault of research. My dad had a bad reaction to some medication and then got a call from the pharmacist saying he'd just realised he'd given Dad the wrong medicine.