October 2016: Determining “Best Available Evidence”

Stockard, J., & Wood, T. W. (2016). The threshold and inclusive approaches to determining “best available evidence’’: An empirical analysis. American Journal of Evaluation. Advance online publication. doi:10.1177/1098214016662338.

Summary by Dr. Jack Fletcher

The What Works Clearinghouse (WWC) is sponsored by the primary federal agency for educational research, the Institute for Educational Sciences (IES). The WWC was designed to evaluate evidence supporting different types of educational practices, with a focus on evidence-based interventions. As the WWC states on its website, its purpose is to review “the existing research on different programs, products, practices, and policies in education. Our goal is to provide educators with the information they need to make evidence-based decisions. We focus on the results from high-quality research to answer the question ‘What works in education?’ To accomplish this goal, the WWC performs systematic reviews of the existing research literature on a variety of educational practices and posts reports on the effectiveness of these practices.”

The approach used by the WWC is to identify and summarize studies that meet a priori criteria for quality. The guiding principle is that a few high-quality studies provide more guidance on best practices than a synthesis of all possible studies, which may vary in their quality and execution. Stockard and Wood characterize the WWC methods for synthesizing and evaluating educational practices as a threshold approach in contrast to an inclusive approach, which includes as many relevant studies as possible. Although the latter approach also applies a priori inclusion criteria to the literature search, these criteria tend to be less restrictive. An inclusive approach assigns quality ratings to studies and uses these ratings to determine the influence of different aspects of study quality and design and how these factors moderate estimates of effectiveness.

Stockard and Wood applaud the goals of the WWC, but express concerns about its use of a threshold approach, which often results in a small set of studies and small samples of participants for a determination of evidence-based practices. They note that most synthesis efforts in the behavioral sciences use an inclusive approach to empirical synthesis (meta-analysis). They also note that the WWC approach, which includes multiple levels of thresholding, was not an articulated methodology in the original policy documents, such as the 2002 National Research Council report, Scientific Research in Education (Shavelson & Towne, 2002). The author was a member of the committee that wrote this report, which provided a blueprint for the creation of IES and the WWC. The report implicitly supported an inclusive approach to empirical synthesis, citing many of the types of strategies also cited by Stockard and Wood. After the report and the creation of IES, the WWC began as a clearinghouse for summarizing educational research using an inclusive approach, but it shifted to using the threshold approach as the WWC evolved. Stockard and Wood are primarily concerned that different estimates of the effectiveness of educational interventions will emerge from syntheses using these distinct approaches to empirical synthesis.

To investigate, Stockard and Wood undertook a comparison of findings for 252 literacy interventions reviewed by the WWC, focusing on the proportion of available studies that met the WWC threshold criteria and the proportion that did not. They then compared the conclusions of the WWC report on the effectiveness of one literacy program, Reading Mastery, to the conclusions that would emerge using a more inclusive approach on all available studies (k = 131) that reported results from different experimental designs permitting comparisons of students exposed and not exposed to the program.

Understanding the Study Results

Study 1

In the first study, only 93 of the 252 reviewed educational programs had been subjected to evaluation studies that met the WWC threshold criteria. This means that over two thirds of the interventions reviewed could not be rated by the WWC. Of the 4,098 available studies investigating the 93 programs, 11% met initial criteria and 4% were considered strong enough to be considered in the final review. As a result of these stringent inclusion criteria, most of the WWC recommendations were based on one to two qualifying studies, typically involving less than 200 students per study. The median value for the WWC evaluations of these 93 literacy programs was a single study including 146 students who were exposed or not exposed to the intervention.

Study 2

The second study investigated whether an inclusive approach would generate a different estimate of effectiveness for a well-investigated program: Reading Mastery. The published WWC report indicates that Reading Mastery could not be evaluated because no studies met their criteria. Stockard and Wood confirmed that none of the 131 studies involving Reading Mastery met all the WWC criteria for inclusion, although 6 studies would with slight relaxations of criteria for publication date and differences on baseline performance measures.

In a second analysis, the researchers found minimal association between effectiveness estimates and the inclusion criteria used by the WWC. Estimates of effectiveness, as indexed by effect sizes comparing the magnitude of differences between students exposed and not exposed to Reading Mastery, which ranged from .39 to .79 when more rigorous controls were included, were not changed by virtue of the inclusion criteria. The lower and upper estimates of effect size estimates would be considered small/moderate to moderate/large and exceed thresholds for educationally meaningful effects (.25) established by the WWC. Interestingly, criteria not used by the WWC, such as study dosage, maintenance, and fidelity of implementation, were related to effect size estimates; larger effect sizes were associated with better research designs and statistical models that included these factors as moderators.

In a final analysis, Stockard and Wood compared effect sizes from the 131 studies emerging from a synthesis using an inclusive approach, with effect sizes emerging from 6 of 34 studies partially meeting the WWC criteria. Estimates from the WWC would be mixed, with negative and positive findings based on small sample sizes and a limited number of studies. This contrasts sharply with the effect size estimates of .39 to .79 that would emerge from the inclusive approach and which would be interpreted as moderate and positive evidence of educational effectiveness.


The questions and issues raised by Stockard and Wood are not unique to educational research. They parallel discussions the author participated in as a member of a National Institutes of Health consensus panel on phenylketonuria (PKU) addressing a variety of best practices for healthcare. PKU is a rare chromosomal defect that was at one point a major cause of severe intellectual disability. It is now preventable through detection as part of newborn screening. Individuals with this anomaly must adhere to a special diet to prevent intellectual disability. However, questions remained about how long individuals with PKU had to remain on the diet and whether going off the diet after puberty was associated with reductions in cognitive skills. Many studies addressed this question, but they tended to be small, varied in experimental control, and yielded mixed results—a situation ideal for a meta-analysis. To address this issue, the panel commissioned an inclusive meta-analysis of this large corpus of data, which found a statistically significant and moderate association between life-long dietary maintenance and higher cognitive functions. As the committee presented its report, a separate medical collaborative that completes research syntheses using a threshold approach announced that it had reviewed the literature and identified only one study that met its inclusion criteria, which required a rigorous randomized trial. This study, which included a small number of participants, found no negative effects associated with discontinuation of the diet. The committee considered both reports, but ultimately recommended that the diet be maintained based on evidence from the inclusive meta-analysis and other contextual considerations. The committee also identified several problems with the single clinical trial that led it to give little weight to the report.

Wood is the director of the National Institute for Direct Instruction, which is responsible for Reading Mastery. His participation in the Stockard and Wood report could raise questions about potential conflict of interest, which was acknowledged by the authors. However, there are other examples of the differences between inclusive and threshold approaches to empirical syntheses. Consider Reading Recovery, a popular first-grade tutoring program. The WWC evaluation of this program identified 79 studies as candidates for the review. However, after applying thresholds, only a small subset of these available studies was the basis for the WWC report. From this subset, the WWC assigned ratings indicating potentially positive to positive effects on multiple reading outcomes. These ratings were based on one to three studies per domain of reading (e.g., decoding, fluency, comprehension, and general reading performance) and samples of 74 to 227 students. The effect size estimates for general reading achievement was .75, which is large. In contrast, other meta-analyses of Reading Recovery show effect sizes in the .32 to .48 range (see review by Fletcher, Lyon, Fuchs, & Barnes, 2007), and are much lower if students who are dropped from the program are included and if standardized reading measures are used rather than the measure developed by the authors (i.e., the Observation Survey). Interestingly, the WWC criteria do not consider the reliability of the outcome measure. Thus, the Observation Survey is included in the WWC report as an index of general reading achievement despite well documented concerns about its reliability. Denton et al. (2006) found that the reliabilities of some subtests of the Observation Survey were too low to use as an outcome measure and recommended against its use in program evaluation studies.

The Department of Education Race to the Top program selected Reading Recovery for a large-scale implementation study. As part of this implementation, an evaluation designed to meet the WWC criteria was required. The WWC concluded that this intervention study of 6,888 students met criteria with no reservations. For the published study of the Year 1 results based on 866 students, or 69% of the original matched pairs randomly assigned to treatment and comparison groups (May et al., 2015), the effect size on overall reading achievement using the Iowa total reading score was .69, which is in the upper moderate range. The full evaluation results for 6,888 students (70% of the matched pairs) through four years of implementation are available online (http://www.ies.ed.gov/ncee/wwc/study/32027) and are largely consistent with the published study. The effect size on the Iowa total reading score was .48 and .99 for the Observation Survey.

In a response to the first-year evaluation, Chapman and Tunmer (2016) identified several problems: 1. Many students were discontinued from the program. These students were more severely impaired in reading. This is a commonly reported weakness of Reading Recovery and its own evaluation studies, but was apparently not applied as a WWC exclusion criterion. This trend was less apparent in the four-year evaluation. 2. Consistent with this finding, the completion rate of 54% in Year 1 was modest, with about 22% of participants referred on for further intervention. 3. Data on reading instruction for the comparison group was reported for only 25% of the participants, so that little is known about the counterfactual (control) of this study. This improved to about 50% in subsequent years. In addition, Chapman and Tunmer noted that standard procedures for Reading Recovery were not followed with fidelity. For example, whereas Reading Recovery implementations select the lowest achieving first-graders for intervention, the evaluation study created a pool from which students could be selected, raising the possibility that students were included because they were considered better suited for the program. Although procedures improved in subsequent years, it is not clear that selection was completely random even if analysis focused on evaluation of matched pairs to prevent differential attrition. The capacity for generalizing the results to all participating schools, essential for an impact study, may be more tenuous. Chapman and Tunmer pointed out that evaluations of the long-term effects for Reading Recovery often show low sustainability, which was apparent in a limited evaluation of the initial cohort in the four-year study. Differences between those exposed and not exposed to Reading Recovery were not statistically significant in a relatively small sample powered to detect differences of about .33, with most estimates well below these detectable thresholds. Although one should be cautious about the null hypothesis, the confidence intervals included 0 on a variety of measures. The WWC evaluation of the i3 evaluation and the rating of Reading Recovery are based on the immediate impact of the intervention.

Both Stockard and Wood and Chapman and Tunmer raise serious questions about the interpretation of WWC ratings of literacy programs. These issues reflect both the use of a threshold approach that eliminates studies not meeting detailed and extensive criteria, as well as the failure to include criteria that are important quality indicators for intervention research, such as fidelity of implementation, maintenance, and the reliability of the outcome measure. It is hard to see how strong generalizations can be made based on such a small subset of studies that include only a small number of participants. In the case of Reading Mastery, the WWC could not provide an evaluation of effectiveness following its stated procedures. For Reading Recovery, the estimate of effectiveness may be inflated, because of the selective nature of the review and overlooked factors that affect effect size estimates. Yet, more inclusive approaches to research synthesis demonstrate that both programs are effective; the educator’s task is to decide how effective, how expensive, and in what contexts an intervention would be most effective. Greater heterogeneity in research synthesis permits these more granular questions to be answered.

To accomplish this task, we would recommend that a WWC evaluation be considered one piece of evidence. However, it is not definitive and other, more inclusive empirical syntheses may offer a more balanced and complete perspective on the effectiveness of a program or practice. In general, we prefer empirical syntheses that are more inclusive and treat variations in study design as moderators of effectiveness, for which Stockard and Wood provide an example. Many educators may find the WWC practice guides more useful because of their attention to a broader range of evidence and issues related to successful implementation. For educators, the key question is less about a categorical determination of effective versus ineffective programs, and more about which intervention principles are associated with effective outcomes and for which students a particular program may be effective. These questions require stepping beyond empirical evaluations of programs and considering contextual, sociodemographic, and school factors related to implementation and how they might affect the effectiveness of a program. The bottom line is that no single source should be considered a basis for determining the effectiveness of an educational program.


Chapman, J. W., & Tunmer, W. E. (2016). Is Reading Recovery an effective intervention for students with reading difficulties? A critique of the i3 Scale-Up study. Reading Psychology, 37(7), 1025–1042. doi:10.1080/02702711.2016.1157538

Denton, C. A., Ciancio, D. J., & Fletcher, J. M. (2006). Validity, reliability, and utility of the Observation Survey of Early Literacy Achievement. Reading Research Quarterly, 41, 8–34.

Fletcher, J. M., Lyon, G. R., Fuchs, L. S., & Barnes, M. A. (2007). Learning disabilities: From identification to intervention (2nd ed.). New York, NY: Guilford Press.

May, H., Gray, A., Sirinides, P., Goldsworthy, H., Armijo, M., Sam, C., & Tognatta, N. (2015). Year one results from the multisite randomized evaluation of the i3 Scale-Up of Reading Recovery. American Educational Research Journal, 52, 547–581.

Shavelson, R. J., & Towne, L. (2002). Scientific research in education. Washington, DC: National Academy Press.