Recent Research Revisited: Does Bigger Mean "Reliable and Replicable"?

(quoted with permission from Read America's Parenteacher Magazine)

by Diane McGuinness, Ph.D.

Perhaps no other reading study has generated so much heat and so little light for so long. This study was funded in 1993, ended its first year in 1995, and data analysis was completed in 1996. A preliminary report was published in 1997 and the full report in March of 1998. Does it confirm media hype, or what Foorman and other have claimed at conferences and during testimony to various state legislators? Equally important, does the 1998 report tally with the 1997 report?

Until 1993, when the NICHD began funding large scale studies, reading research was firmly grounded in the small group model. This model was adopted for the scientific study of human behaviour from the work of William Gossett (t-tests) and Sir Ronald Fisher (analysis of variance). It has remained the model for such study for nearly 100 years. Fisher stressed that each research design dictates which test can be used, and that random assignment of subjects to conditions or 'treatment' is essential. Bypassing these requirements means you cannot use the statistics invented for a particular design. In education research, the small group model proceeds in discrete steps. First there is a pilot study to see if teachers can learn the method and find it easy to teach - that they will use it, and that the materials and sequence match the child's developmental level - that the program is user friendly. Once these issues are established, small group studies are designed to answer specific questions.

Does intervention in one or two classrooms make a difference when compared to control groups who get some or none of the components?
Does intervention make a difference when the study (now refined) is carried out by the teacher?
Does intervention make a difference when compared to other methods in highly controlled situations, such as a small number of classrooms, random assignment of children to treatment groups, and the commitment of teachers to the method?
Does this intervention hold up over time?

An excellent example of this kind of careful, systematic research, is the work of Bonita Blachman and her colleagues. Cost of the research that put phonological processing on the map was conducted in this way. The challenge of this approach is that the same or similar study needs to be replicated many time, preferably by other scientists.

NICHD's current model involves millions of dollars of funding for several large-scale studies. The NICHD mandate is to look at every variable that might impact on reading, and to use large numbers of students in each study to do this. These large numbers are supposed to cancel out 'noise' in the data due to loss of control by the experimenter. For example, testing ad data entry are turned over to undergraduate or graduate students who have not reached the level of professionalism and expertise of the scientists in charge, who in a smaller study would be doing the testing and data entry themselves. The Foorman study had problems before they even began testing. There was a gross mismatch between the number of students and the number of variables. An explicit experimental design was missing. There was also a persistent use of the wrong statistical test. Let's have a closer look.

The 1998 report provides a synopsis of three studies ongoing in Houston. Only the third will we discussed here. This has come to be known as "The Houston Study" or the "Open Court Study". The 1998 version contains the statistical analysis. The purpose of the study was to compare Open Court's Phonics plus phonemic awareness program (OC) to an "Embedded Phonics" program (EP), a Whole Language Program with teachers receiving special training (WL-T), and Whole Language with no special teacher training (WL-S). So the experiment had four treatment groups.

The Study was carried out in first and second grade classrooms in eight Title One schools, in 70 classrooms (1997 report), five of which were eliminated in the 1998 report. Thus the experiment had two subject groups: first and second grade. The sample consisted of 375 children (1997). This number was reduced by 90 (1998), because these students had no tutorials. There were 174 first graders (reduced by 17%) and 103 second graders (reduced by 38%). The majority of the children dropped came from Open Court classrooms, with 18 OC classrooms (1997) and only 13 in 1998. By contrast, classroom numbers stayed the same for the remaining groups (EP), (WL-T), and (WL-S).

There is no description in either paper of how teachers were assigned to methods. Each group (except WL-S which was the no training WHole Language group) received 30 hours of teacher training. Open Court teacher were trained by repesentatives of McGraw Hill publishers (owners of Open Court). Open Court contributed all calssroom materials free of charge. The other training (EP and WL-T) was carried out by project personnel.

...Only children who qualified for tutoring were in the study. This was estimated as between three to eight children per classroom. There was no random assignment to classrooms, a requirement of the statistics employed. Ten of the thirteen WL-S classrooms were in one school described as "tough". This was the school with the highest poverty level (71% free lunch). The WL-T and EP classrooms were evenly split between high (64% free lunch) and lower poverty level (32-50% free lunch) schools. The Open Court classrooms were over twice as likely to be in the districts' lower poverty schools. This lack of balance for poverty was repeated for ethnicity. The experimental design had two levels of poverty and three ethnic groups.

Tutorials were employed in addition to classroom exposure. Tutorials were taught either one-on-one or in groups of three to five for 30 minutes each day, totaling to approximately 80 hours. The experimental design adds two tutorial treatments: one-on-one or small group. Twenty-eight Title Court trainers. Tutors were expected to change hats and deliver various methods (OC), (EP, (WL-T) at will, depending on which child they were working with. To add to the confusion, they had been using Reading Recovery prior the study, ad were no doubt better versed in that method. To complicate matters further, tutoring methods either did or did not match the classroom program as part of the experimental design. This adds another tutorial treatment: match versus mismatch. Ultimately it was not possible to compare one-on-one to small group tutoring because the tutors kept reconstituting the groups.

The authors never set out the experimental design, so the reader mush infer it. This is what we have so far: four methods, two grades, three races, two poverty levels, two types of tutorials (single or group), two more types of tutorials (match or mismatch). This is a six factor design, or a 4x2x3x2x2x2 factorial ANOVA design. The number of conditions is determined by multiplying the number of levels for each factor: 4x2x3x2x2x2=192. Carried out appropriately, in order to meet the statistical requirements of such a study, we would need twenty subjects for each condition, a total of 3,840 subjects. The 285 subjects in this study fall a little short of that number. Later in the paper the authors note that in year two of the study, some children switch methods and some do not. This would add another factor to the design requiring 7,680 students.

There is a partial solution to this problem. You can 'control out' subject variables by random assignment to treatments. In this study, that would mean that all treatment conditions had equal proportions of three racial groups and equivalent poverty levels. But the authors did not do this. Instead, they "adjusted for ethnicity" and ignored poverty levels entirely. They eliminate grades by compiling the first and second grade data. In presenting the results, they eliminate the tutoring variable by voiding the tutorials altogether, stating that, "one-on-one versus small group tutorials had no significant effect on rate of reading growth or outcome measure." Next they stat that match versus mismatch of classroom and tutorial method had no effect either. Effect on what, is not revealed. Nor is it revealed how they might know this, since the statistics they employed were not presented.

Having neatly disposed of five factors in a six factor experimental design, results are presented for what remains, the comparison of reading methods. But now something else happens. The authors introduce a new condition: repeated testing over time, adding a "repeated measures" condition with four test periods. We now have a 4x4 mixed ANOVA design.

Children were tested initially and three more time over the year with two instruments: the Torgesen/Wagner phonological test battery, and a word reading test. THe authors explained that the word reading test was designed in-house because "the standardized test have too few easy words on them". Neither test is a standardized test. It should be noted here that standardized tests are employed for research to be "reliable and replicable". A need for "reliable and replicable" test is revealed by the fact that in almost every case the reading test standard deviations were equal to or considerably higher than the mean. This means the variance was so high their reading test has no validity.

Pages of statistical analysis were devoted to various growth curves and "predicted growth curves", based on these invalid tests, and so, are essentially meaningless. In comparing reading programs, Open Court children had significantly higher phonological scores in April. However, this result occurred after the blending sub-tests were eliminated, and the scores from the analysis sub-tests were reduced to factor scores based on "district norms". April scores on their in-house reading test showed that the Open Court group read 12.7 words, versus 5 words for the embedded Phonics and WL-T groups, and 1.9 words for the WL-S group. This modest effect was not significant after post hoc tests.

A result that misses significance on an invalid test is scarcely anything to shout about. Nor is this even relevant in view of the authors' recent statement that "the word reading list was constructed to have lots of words at relatively similar levels of difficulty, so a mean difference of a few words is not meaningful." [Foorman et al, letter to Denny Taylor. CATENet, April, 1998]

Results not mentioned in the text were presented in a table of reading mean scores. These reveal that the second graders' April scores did not differ much for the different reading programs. Average words read were: Open Court=19.4, Embedded Phonics=18.3, and the two whole language groups were WL-T=16.2 and WL-S=14.3. Given the high standard deviations for these values, it is unlikely these scores would differ significantly.

The children took standardized test in May. There were no pretests given on these test, making these data irrelevant. The standardized scores are simply hanging in the wind, with no idea of how much students gained (or lost) over the school year.

In 1997, Foorman, et al reported that the Open Court groupds had significantly higher percentile scores on all the Woodcock-Johnson subtests and Kaufman spelling. In 1998, using standardized scores and post hoc tests, only the combined word identification/word attack score was significantly different between groups. No other group comparisons were significant after post hoc tests were carried out. The combined word identification/word
attack scores were Open Court=96.1, Embedded Phonics=88.6, WL-T=89.6, WL-S=84.5. Depending on which conversion table you use, 96 standard score is in the range 40-43 percentile ranks, which means that 60% of the first and second grade population scores higher than these children.

As the authors had already dismissed tutoring as relevant to anything, they conclude that any effect of teaching method is entirely due to the classroom program! If, indeed, the size of tutoring group, and matched versus mismatched tutorials, had no effect (no statistics were provided for this assertion), this does not mean that tutoring, per se, had no effect. To prove this, you would need an untutored control group (another missing variable in this study). Without an untutored control group, there is no way to disentangle tutoring effect from classroom effects. This is puzzling omission, because 90 untutored children dropped from the study could have been used as a control group.

In conclusion, there are so many methodological problems with this study, that a variety of explanations for the one significant result could be proposed: 1.)Tutors in the Open Court schools were better teachers. This would mean that tutoring would have a greater effect in OC schools. 2.)Children from different ethnic groups responded differently to reading programs. No information is provided in either paper on the breakdown or proportions of races in the different classrooms, so it is impossible to know, let alone replicate. 3.)Teachers taught by the representatives from the company that publishes the program are better trained than teachers who are not. In March, 1998, Foorman et al state, in a letter to Denny Taylor: "our interest in including the Whole Language teacher group with no special training (WL-S) is to demonstrate that training of teachers makes a difference in pupil outcomes." By this same logic, teachers who had even better training, from professional trainers with a vested interest (Open Court representatives), might make a difference in pupil outcomes. Why weren't Whole Language experts group in for teacher training of the Whole Language group? 4.)Children who have higher initial reading scores will have higher post-test scores. There were no intake scores in this study, so there is no control available for this. 5.)Children who do not suffer from extreme poverty are more able to focus on their studies, are not distressed, unwell, or hungry during lessons or testing. 6.)The results were mainly due to first grade data, with no advantage of Open Court in second grade, when children should be expected to be doing very well in their reading. As standardized test results were not presented separately for first and second graders, the study is not reliable.

Foorman continues in the letter to Taylor: "Random assignment is an important safeguard to internal validity," but then says, "We were unable to assign teachers randomly to curricula, ad we were unable to randomly assign children to teachers. This situation is not atypical in large scale investigations in education..."[this is] why we believe replicability and replication are so important in education research. Well, if large scale investigations preclude proper research design, and random allocation of subjects to groups, why do them in the first place? Why assume that replication of a bad design is going to solve anything?

A final question is what happened in the second year of the study when children did or did not switch from one method to another (per the research design), and a second cohort started their first year.

In the discussion section the authors confidently assert that direct instruction in Open Court is responsible for the "significant effects" on the reading tests. They note that the standardized passage comprehension test was not significant by post hoc tests, but argue for ten more lines that the Open Court group really was superior to all other groups. The bias of the authors is evident throughout this section, with no attempt to reflect on the problems of their design. This is not in the spirit of objective scientific inquiry.

There was an incredible statement about tutoring programs that are "springing up around the United States" which mismatch the classroom program, and because of the, "are ineffective." One reference is provided for this statement. This comes out of the blue, apparently as an oblique attack on private reading specialists, who, at their peril, ignore what the child is being taught in class. Yet in their own study, the authors claim that the mismatched tutorials had "no effect". The authors cannot have it both ways. If mismatched tutorials were thought to be ineffective in the first place, this study is invalid and borders on being unethical.

Finally, the authors advise the reader that, "Future studies should evaluate entire classrooms, not just Title I children," as if no one had ever studied whole classrooms before. As the reader is well aware, people like Slavin, Howard, Blachman, Engelmann, Wallach, and scores of others, have been doing classroom research for years.

In the final analysis, this study may have set scientific research on reading back a decade. It has utterly failed to show whether any of these programs is better than the other. For the authors to draw firm conclusion from such a study is irresponsible. Their lack of objectivity and evident bias will cause people in the education community to be even more suspicious of "science" than they were already. This is a disservice to everyone in the field, as well as a disservice to the children, the teachers, and the school administrators, all of whom participated in this study in good faith.

The above article was taken from April, 1998, ParenTeacher Magazine. Call 1-800-732-3868 to request a free copy of this month's issue.

Read B. Foorman's response to this article

back to top