Garden-variety selection bias is bad enough in experimental design, but many of the studies suffer from design features that add to concerns about selection bias. In particular, many of the curriculum evaluations use a post-hoc design, in which a group of schools using a given program, perhaps for many years, is compared after the fact to schools that matched the experimental program at pretest or that matched on other variables, such as poverty or reading measures. The problem is that only the “survivors” are included in the study. Schools that bought the materials, received the training, but abandoned the program before the study took place are not in the final sample, which is therefore limited to more capable schools. As one example of this, Waite (2000), in an evaluation of Everyday Mathematics, described how 17 schools in a Texas city originally received materials and training. Only 7 were still implementing it at the end of the year, and 6 of these agreed to be in the evaluation. We are not told why the other schools dropped out, but it is possible that the staffs of the remaining 6 schools may have been more capable or motivated than those that dropped the program. The comparison group within the same city was likely composed of the full range of more and less capable school staffs, and they presumably had the same opportunity to implement Everyday Mathematics but chose not to do so. Other post-hoc studies, especially those with multi-year implementations, must have also had some number of dropouts, but typically do not report how many schools there were at first and how many dropped out. There are many reasons schools may have dropped out, but it seems likely that any school staff able to implement any innovative program for several years is a more capable, more reform-oriented, or better-led staff than those unable to do so, or (even worse) than those that abandoned the program because it was not working. As an analog, imagine an evaluation of a diet regimen that only studied people who kept up the diet for a year. There are many reasons a person might abandon a diet, but chief among them is that it is not working, so looking only at the non-dropouts would bias such a study.
Worst of all, post-hoc studies usually report outcome data selected from many potential experimental and comparison groups, and may therefore report on especially successful schools using the program or matched schools that happen to have made particularly small gains, making an experimental group look better by comparison. The fact that researchers in post-hoc studies often have pre- and posttest data readily available on hundreds of potential matches, and may deliberately or inadvertently select the schools that show the program to best effect, means that readers must take results from after-the-fact comparisons with a grain of salt.
Finally, because post-hoc studies can be very easy and inexpensive to do, and are usually contracted for by publishers rather than supported by research grants or done as dissertations, such studies are likely to be particularly subject to the “file drawer” problem. That is, post-hoc studies that fail to find expected positive effects are likely to be quietly abandoned, whereas studies supported by grants or produced as dissertations will almost always result in a report of some kind. The file drawer problem has been extensively described in research on meta-analyses and other quantitative syntheses (see, for example, Cooper, 1998), and it is a problem in all research reviews, but it is much more of a problem with post-hoc studies.
Out of the 87 studies on elementary math programs that Slavin found to be sufficiently scientific 38% were post hoc designs. Notably all the "positive" research for the NSF funded constructivist math programs were post hoc designs. Moreover, in almost all cases, these post hoc studies yielded educationally insignificant effect sizes, i.e., less than 0.25 sd. The same is true for most of the major math textbook curricula.