(quoted with permission from Read America's
Parenteacher Magazine)
by Diane McGuinness, Ph.D.
Perhaps no other reading study has generated so much heat and so little light
for so long. This study was funded in 1993, ended its first year in 1995,
and data analysis was completed in 1996. A preliminary report was published
in 1997 and the full report in March of 1998. Does it confirm media hype, or
what Foorman and other have claimed at conferences and during testimony to various
state legislators? Equally important, does the 1998 report tally with the 1997
report?
Until 1993, when the NICHD began funding large scale studies, reading research
was firmly grounded in the small group model. This model was adopted for the
scientific study of human behaviour from the work of William Gossett (t-tests)
and Sir Ronald Fisher (analysis of variance). It has remained the model for
such study for nearly 100 years. Fisher stressed that each research design dictates
which test can be used, and that random assignment of subjects to conditions
or 'treatment' is essential. Bypassing these requirements means you cannot use
the statistics invented for a particular design. In education research, the
small group model proceeds in discrete steps. First there is a pilot study to
see if teachers can learn the method and find it easy to teach - that they will
use it, and that the materials and sequence match the child's developmental
level - that the program is user friendly. Once these issues are established,
small group studies are designed to answer specific questions.
Does intervention in one or two classrooms make a difference when
compared to control groups who get some or none of the components?
Does intervention make a difference when the study (now refined)
is carried out by the teacher?
Does intervention make a difference when compared to other methods
in highly controlled situations, such as a small number of classrooms, random
assignment of children to treatment groups, and the commitment of teachers
to the method?
Does this intervention hold up over time?
An excellent example of this kind of careful, systematic research, is the work
of Bonita Blachman and her colleagues. Cost of the research that put phonological
processing on the map was conducted in this way. The challenge of this approach
is that the same or similar study needs to be replicated many time, preferably
by other scientists.
NICHD's current model involves millions of dollars of funding for several large-scale
studies. The NICHD mandate is to look at every variable that might impact on
reading, and to use large numbers of students in each study to do this. These
large numbers are supposed to cancel out 'noise' in the data due to loss of
control by the experimenter. For example, testing ad data entry are turned over
to undergraduate or graduate students who have not reached the level of professionalism
and expertise of the scientists in charge, who in a smaller study would be doing
the testing and data entry themselves. The Foorman study had problems before
they even began testing. There was a gross mismatch between the number of students
and the number of variables. An explicit experimental design was missing. There
was also a persistent use of the wrong statistical test. Let's have a closer
look.
The 1998 report provides a synopsis of three studies ongoing in Houston. Only
the third will we discussed here. This has come to be known as "The Houston
Study" or the "Open Court Study". The 1998 version contains the statistical
analysis. The purpose of the study was to compare Open Court's Phonics plus
phonemic awareness program (OC) to an "Embedded Phonics" program (EP), a Whole
Language Program with teachers receiving special training (WL-T), and Whole
Language with no special teacher training (WL-S). So the experiment had four
treatment groups.
The Study was carried out in first and second grade classrooms in eight Title
One schools, in 70 classrooms (1997 report), five of which were eliminated in
the 1998 report. Thus the experiment had two subject groups: first and second
grade. The sample consisted of 375 children (1997). This number was reduced
by 90 (1998), because these students had no tutorials. There were 174 first
graders (reduced by 17%) and 103 second graders (reduced by 38%). The majority
of the children dropped came from Open Court classrooms, with 18 OC classrooms
(1997) and only 13 in 1998. By contrast, classroom numbers stayed the same for
the remaining groups (EP), (WL-T), and (WL-S).
There is no description in either paper of how teachers were assigned to methods.
Each group (except WL-S which was the no training WHole Language group) received
30 hours of teacher training. Open Court teacher were trained by repesentatives
of McGraw Hill publishers (owners of Open Court). Open Court contributed all
calssroom materials free of charge. The other training (EP and WL-T) was carried
out by project personnel.
...Only children who qualified for tutoring were in the study. This was
estimated as between three to eight children per classroom. There was no random
assignment to classrooms, a requirement of the statistics employed. Ten of the
thirteen WL-S classrooms were in one school described as "tough". This was the
school with the highest poverty level (71% free lunch). The WL-T and EP classrooms
were evenly split between high (64% free lunch) and lower poverty level (32-50%
free lunch) schools. The Open Court classrooms were over twice as likely to
be in the districts' lower poverty schools. This lack of balance for poverty
was repeated for ethnicity. The experimental design had two levels of poverty
and three ethnic groups.
Tutorials were employed in addition to classroom exposure. Tutorials were taught
either one-on-one or in groups of three to five for 30 minutes each day, totaling
to approximately 80 hours. The experimental design adds two tutorial treatments:
one-on-one or small group. Twenty-eight Title Court trainers. Tutors were expected
to change hats and deliver various methods (OC), (EP, (WL-T) at will, depending
on which child they were working with. To add to the confusion, they had been
using Reading Recovery prior the study, ad were no doubt better versed in that
method. To complicate matters further, tutoring methods either did or did not
match the classroom program as part of the experimental design. This adds another
tutorial treatment: match versus mismatch. Ultimately it was not possible to
compare one-on-one to small group tutoring because the tutors kept reconstituting
the groups.
The authors never set out the experimental design, so the reader mush infer
it. This is what we have so far: four methods, two grades, three races, two
poverty levels, two types of tutorials (single or group), two more types of
tutorials (match or mismatch). This is a six factor design, or a 4x2x3x2x2x2
factorial ANOVA design. The number of conditions is determined by multiplying
the number of levels for each factor: 4x2x3x2x2x2=192. Carried out appropriately,
in order to meet the statistical requirements of such a study, we would need
twenty subjects for each condition, a total of 3,840 subjects. The 285 subjects
in this study fall a little short of that number. Later in the paper the authors
note that in year two of the study, some children switch methods and some do
not. This would add another factor to the design requiring 7,680 students.
There is a partial solution to this problem. You can 'control out' subject
variables by random assignment to treatments. In this study, that would mean
that all treatment conditions had equal proportions of three racial groups and
equivalent poverty levels. But the authors did not do this. Instead, they "adjusted
for ethnicity" and ignored poverty levels entirely. They eliminate grades by
compiling the first and second grade data. In presenting the results, they eliminate
the tutoring variable by voiding the tutorials altogether, stating that, "one-on-one
versus small group tutorials had no significant effect on rate of reading growth
or outcome measure." Next they stat that match versus mismatch of classroom
and tutorial method had no effect either. Effect on what, is not revealed. Nor
is it revealed how they might know this, since the statistics they employed
were not presented.
Having neatly disposed of five factors in a six factor experimental design,
results are presented for what remains, the comparison of reading methods. But
now something else happens. The authors introduce a new condition: repeated
testing over time, adding a "repeated measures" condition with four test periods.
We now have a 4x4 mixed ANOVA design.
Children were tested initially and three more time over the year with two instruments:
the Torgesen/Wagner phonological test battery, and a word reading test. THe
authors explained that the word reading test was designed in-house because "the
standardized test have too few easy words on them". Neither test is a standardized
test. It should be noted here that standardized tests are employed for research
to be "reliable and replicable". A need for "reliable and replicable" test is
revealed by the fact that in almost every case the reading test standard deviations
were equal to or considerably higher than the mean. This means the variance
was so high their reading test has no validity.
Pages of statistical analysis were devoted to various growth curves and "predicted
growth curves", based on these invalid tests, and so, are essentially meaningless.
In comparing reading programs, Open Court children had significantly higher
phonological scores in April. However, this result occurred after the blending
sub-tests were eliminated, and the scores from the analysis sub-tests were reduced
to factor scores based on "district norms". April scores on their in-house reading
test showed that the Open Court group read 12.7 words, versus 5 words for the
embedded Phonics and WL-T groups, and 1.9 words for the WL-S group. This modest
effect was not significant after post hoc tests.
A result that misses significance on an invalid test is scarcely anything to
shout about. Nor is this even relevant in view of the authors' recent statement
that "the word reading list was constructed to have lots of words at relatively
similar levels of difficulty, so a mean difference of a few words is not meaningful."
[Foorman et al, letter to Denny Taylor. CATENet, April, 1998]
Results not mentioned in the text were presented in a table of reading mean
scores. These reveal that the second graders' April scores did not differ much
for the different reading programs. Average words read were: Open Court=19.4,
Embedded Phonics=18.3, and the two whole language groups were WL-T=16.2 and
WL-S=14.3. Given the high standard deviations for these values, it is unlikely
these scores would differ significantly.
The children took standardized test in May. There were no pretests given on
these test, making these data irrelevant. The standardized scores are simply
hanging in the wind, with no idea of how much students gained (or lost) over
the school year.
In 1997, Foorman, et al reported that the Open Court groupds had significantly
higher percentile scores on all the Woodcock-Johnson subtests and Kaufman spelling.
In 1998, using standardized scores and post hoc tests, only the combined word
identification/word attack score was significantly different between groups.
No other group comparisons were significant after post hoc tests were carried
out. The combined word identification/word
attack scores were Open Court=96.1, Embedded Phonics=88.6, WL-T=89.6, WL-S=84.5.
Depending on which conversion table you use, 96 standard score is in the range
40-43 percentile ranks, which means that 60% of the first and second grade population
scores higher than these children.
As the authors had already dismissed tutoring as relevant to anything, they
conclude that any effect of teaching method is entirely due to the classroom
program! If, indeed, the size of tutoring group, and matched versus mismatched
tutorials, had no effect (no statistics were provided for this assertion), this
does not mean that tutoring, per se, had no effect. To prove this, you would
need an untutored control group (another missing variable in this study). Without
an untutored control group, there is no way to disentangle tutoring effect from
classroom effects. This is puzzling omission, because 90 untutored children
dropped from the study could have been used as a control group.
In conclusion, there are so many methodological problems with this study, that
a variety of explanations for the one significant result could be proposed:
1.)Tutors in the Open Court schools were better teachers. This would mean that
tutoring would have a greater effect in OC schools. 2.)Children from different
ethnic groups responded differently to reading programs. No information is provided
in either paper on the breakdown or proportions of races in the different classrooms,
so it is impossible to know, let alone replicate. 3.)Teachers taught by the
representatives from the company that publishes the program are better trained
than teachers who are not. In March, 1998, Foorman et al state, in a letter
to Denny Taylor: "our interest in including the Whole Language teacher group
with no special training (WL-S) is to demonstrate that training of teachers
makes a difference in pupil outcomes." By this same logic, teachers who had
even better training, from professional trainers with a vested interest (Open
Court representatives), might make a difference in pupil outcomes. Why weren't
Whole Language experts group in for teacher training of the Whole Language group?
4.)Children who have higher initial reading scores will have higher post-test
scores. There were no intake scores in this study, so there is no control available
for this. 5.)Children who do not suffer from extreme poverty are more able to
focus on their studies, are not distressed, unwell, or hungry during lessons
or testing. 6.)The results were mainly due to first grade data, with no advantage
of Open Court in second grade, when children should be expected to be doing
very well in their reading. As standardized test results were not presented
separately for first and second graders, the study is not reliable.
Foorman continues in the letter to Taylor: "Random assignment is an important
safeguard to internal validity," but then says, "We were unable to assign teachers
randomly to curricula, ad we were unable to randomly assign children to teachers.
This situation is not atypical in large scale investigations in education..."[this
is] why we believe replicability and replication are so important in education
research. Well, if large scale investigations preclude proper research
design, and random allocation of subjects to groups, why do them in the first
place? Why assume that replication of a bad design is going to solve anything?
A final question is what happened in the second year of the study when children
did or did not switch from one method to another (per the research design),
and a second cohort started their first year.
In the discussion section the authors confidently assert that direct instruction
in Open Court is responsible for the "significant effects" on the reading tests.
They note that the standardized passage comprehension test was not significant
by post hoc tests, but argue for ten more lines that the Open Court group really
was superior to all other groups. The bias of the authors is evident throughout
this section, with no attempt to reflect on the problems of their design. This
is not in the spirit of objective scientific inquiry.
There was an incredible statement about tutoring programs that are "springing
up around the United States" which mismatch the classroom program, and because
of the, "are ineffective." One reference is provided for this statement. This
comes out of the blue, apparently as an oblique attack on private reading specialists,
who, at their peril, ignore what the child is being taught in class. Yet in
their own study, the authors claim that the mismatched tutorials had "no effect".
The authors cannot have it both ways. If mismatched tutorials were thought to
be ineffective in the first place, this study is invalid and borders on being
unethical.
Finally, the authors advise the reader that, "Future studies should evaluate
entire classrooms, not just Title I children," as if no one had ever studied
whole classrooms before. As the reader is well aware, people like Slavin, Howard,
Blachman, Engelmann, Wallach, and scores of others, have been doing classroom
research for years.
In the final analysis, this study may have set scientific research on reading
back a decade. It has utterly failed to show whether any of these programs is
better than the other. For the authors to draw firm conclusion from such a study
is irresponsible. Their lack of objectivity and evident bias will cause people
in the education community to be even more suspicious of "science" than they
were already. This is a disservice to everyone in the field, as well as a disservice
to the children, the teachers, and the school administrators, all of whom participated
in this study in good faith.
The above article was taken from April, 1998,
ParenTeacher Magazine.
Call 1-800-732-3868 to request a free copy of this month's issue.
Read B. Foorman's response to this article