June 19, 1998
Diane McGuinness, Ph.D.
University of South Florida -
St. Petersburg
140 7th Avenue South
St. Petersburg, FL 33701
Dear Dr. McGuinness:
We were recently bombarded by faxed copies of a review of our study published
in Journal of Educational Psychology this past March, which you apparently completed
for Parenteacher magazine. We are writing to express our concern about the tenor
and inaccuracy of your review. We were particularly concerned because many of
the issues that you raised have been addressed in other venues by our group,
some of which you have clearly read. We don't understand the need to distort
and misrepresent our research, much less the need to denigrate research that
your review suggests you don't really understand. We are also concerned because
of the misunderstanding of NICHD research embedded in your review. We will address
some of our concerns on a point-by-point basis.
In the Introduction, you implied that there is a difference between what
we report at conferences and during testimony and the actual published report.
This is not the case. For example, we have never claimed that our study pits
phonics against whole language. We have always noted that the instructional
principle of primary interest is the explicitness of instruction in the alphabetic
principle necessary to facilitate early reading skills in high risk populations.
The 1997 paper it is clearly described as "preliminary findings".
The depiction of large scale and small group studies is puzzling. As we're
sure you are aware, the NICHD has funded research on reading skills since 1965.
The size of the studies varies. However, your notion that the "small group model"
is tied to certain types of statistical methods is incorrect. For example, Fisher
developed the analysis of variance as a shortcut because of the amount of time
required to compute discriminant functions prior to the advent of computers.
He viewed it as a limited method. There is no strong relationship between certain
types of statistical methods and variations in experimental design. Because
of limitations of these types of models, some of which are inherent in the absence
of high speed computers, it was often necessary to design studies that would
correspond to the limitations of existing statistical models, such as ANOVA.
The notion that "random assignment of subjects to conditions ... is essential"
is not correctly attributed to Fischer and is not a requirement for any parametric
statistic. For example, ANOVA has only two primary assumptions. The first is
normality, an assumption to which ANOVA is usually robust. The second is independence
of observations, to which ANOVA is not robust. There is no requirement for random
assignment as far as statistics are concerned. Certainly, as Campbell and Stanley
(1962) outlined, random assignment has considerable influence on the strength
of inferences that can be made from different kinds of experimental designs.
However, even when random assignment is not possible, which is standard in most
instructional research, inferences can still be made from strong quasi-experimental
studies. Since you read our explanation of this point in our response to Denny
Taylor, it is surprising that you bring up this issue as a problem. Moreover,
it is astonishing that you would suggest that random assignment is required
for parametric statistics. The notion of how a "small group model proceeds"
is arbitrary and artificial, followed in no systematic way by any reading researcher
whom we know.
Benita Blachman is our colleague. We have worked closely with her for several
years. You may be interested to know that in her current intervention studies,
we are providing some of the methodological expertise. Her current studies now
are designed to use multi-level models of the type commonly recommended for
instructional research and are designed so that individual change models can
be used. Individual change models are important because they avoid many of the
problems of traditional pre-post design and statistics such as ANOVA, which
don't permit analyses of individual change and are hampered by the widely acknowledged
problems evaluating difference scores.
NICHD has no particular mandate "to look at every variable that might impact
on reading, and to use large numbers of children in each study …" Most of the
NICHD research is investigator initiated. It receives an independent review
by a group of peers on a study section. Program is independent of review. Program
cannot fund studies that don't pass muster under peer review. Any grant funded
must be at a certain level of quality based on comparisons across the entire
NIH. This is the same model used to fund all biomedical research in the National
Institutes of Health. Although Program might recommend "large scale studies",
the reality is that the NICHD program directed by Reid Lyon funds research at
all levels, involving many different types of designs and many different sample
sizes. The notion that "scientists" would "be doing the testing and data entry
themselves" is not correct. This actually depends more on whether the investigator
has funding. If they have funding, you can be assured that they will typically
hire research assistants to do data collection and data entry. This is a technical
skill that does not require a "scientist."
The notions about the mismatch between the number of students and number
of variables, absence of explicit experimental design, and use of the wrong
statistical tests is simply incorrect. Moreover, your depiction of the purpose
of the study is incorrect, as is your description of the program. Open Court's
(1995) Collections for Young Scholars is not a phonics program. It is a balanced
approach that includes phonemic awareness, phonics, and a significant emphasis
on literature.
We clearly conceptualized our study from a traditional analysis of variance
model. In fact, as we explicitly described in our published paper, this study
was conceptualized as a multi-level design in which children were nested within
classroom within schools. There is a substantial literature in the social sciences
on the importance of multi-level models for estimating sources of variability
attributable to different components of a particular educational practice or
intervention. Multi-level models were developed in part because of the limitations
of models based on analysis of variance designs. Your comments clearly indicate
that you do not understand this area of a educational research despite the amount
of literature in these areas. Hence, your notion of the experimental design
is simply incorrect. There are indeed four methods in two grades. There are
three levels of ethnicity represented, but this was never conceptualized as
a between-subjects factor. Moreover, all the children were Title I eligible,
so there are not two levels of poverty. You are confusing the instructional
curriculums with differences within the schools. Your idea of some sort of strict
correspondence between number of subjects and factors is erroneous. Even in
an analysis of variance design, this would depend on power. All the decisions
that we made concerning collapsing of grade 1 and grade 2 analyses, the tutorials,
etc. were based on statistical analyses demonstrating no effects of these components
As we stated in the paper, we chose to evaluate variability due to age (a continuous
variable) and not grade. Because they had no effect on instructional outcomes,
this is what was reported. Again, you state that "there was no random assignment
to classrooms, a requirement of the statistics employed." This is not a requirement
of statistics. It has more to do with how the results are interpreted and the
strengths of causal inferences that can be made from the design. The bottom
line is that you apparently do not understand multi-level, individual change
models, which is unfortunate.
Many of your descriptions concerning the assignment of teachers to methods,
training, tutoring, nature of the schools, etc. are nothing more than guesses
on your part. Since you read the posts, you know that your depiction of our
relationship with Open Court is incorrect. Open Court provided several days
of in-service training for our teachers, consistent with the amount of training
in a particular curriculum that our instructional leaders provided in the other
curricular areas. At the time of study, McGraw-Hill did not own Open Court.
Open Court was selected by the school district because the district used Open
Court Math. Materials were provided to the district free of charge as a pilot
basis in 1994-95, prior to release of the 1995 program. We have no idea of how
you could conclude "the Open Court classrooms were over twice as likely to be
in the District's lower poverty schools." All the children served were Title
I eligible because of the school-level designation and their literacy levels.
There were no differences in age, ethnicity, free lunch programs, or any sociodemographic
variable across the instructional groups. As the paper states, we controlled
for school effects by placing more than one curriculum in a school. We controlled
for tutor effects by asking tutors to deliver more than one tutorial approach
– either the district standard (Reading Empowerment – not Reading Recovery)
or Open Court or Embedded Phonics. Our question is why it is necessary to distort
the nature of the study as well as the results. As Keith Stanovich noted in
his 1997 Causey Award Address, reading professionals tend to deal with issues
by distortion of the database when in fact we should all share in the rich of
the database that we share.
Your depiction of the measures is not correct. You seemed to have confused
"norm-referenced" with "standardized". In fact, both our word reading tests
and the Torgesen-Wagner battery have extensive psychometric research supporting
reliability and validity. It is absolutely correct that the Woodcock-Johnson
has insufficient number of words to detect change. This is why the word list
is necessary. The word reading list correlates .88 with the word reading subtest
from the Woodcock-Johnson. The performance of children on the test (i.e., the
relationship of means and standard deviations) has little to do with the reliability
of the measures, but says more about the variability of the children. You can't
conclude that the reading tests has no validity based solely on student performance.
How you concluded that the tests are "invalid" is frankly ridiculous and is
little more than attempt to denigrate based on your own lack of understanding.
Many of your descriptions of the results are not correct. As the paper states,
we eliminated components of the phonological awareness tests because they were
highly correlated (r >.9). The results occur whether we use analysis or synthesis
subtests. The differences on the word list were statistically significant, even
with a conservative alpha adjustment utilized. This was apparent in the growth
curve analyses, reflecting effects on both slope and intercept. T tests that
were conducted were used to follow-up what you describe as "ANOVA" statistics,
which is not a correct depiction of what was done. We performed overall tests
and followed "significant" overall tests with the t tests. The comments about
the April scores reflect only the intercept effect, not the effect of slope.
They ignore completely the log linear analyses that were done. We did not give
pre-tests on the norm-referenced educational tests because we did not think
that such analyses were necessary, given the amount of information that we had
on change over time during the year. Surely you are aware of all the issues
with the computation of statistical significance in statistical designs based
on difference scores, pre-post designs, etc. As we reported, there were no differences
on the initial assessment which could be represented as a pre-test if that was
the type of experimental design that was being employed. Your suggestion that
the only significant group comparisons involved the Woodcock-Johnson and Kaufmann
Spelling is incorrect. You are ignoring the growth analyses for word reading
and for phonological processing. Surely you are aware of all the controversy
over the emphasis on statistical significance and you seem to ignore the information
that we provided on effect sizes. Your statements such "60% of the first and
second grade populations scores was higher than these children" is not correct.
It is impossible to make inferences like this from an average percentile rank.
The 90 untutored children cannot be used as "control" because of the possibility
of school effects and the way in which decisions about tutoring were made by
the District and by the teachers.
The notion of "a variety of explanations for the one significant result"
is not correct and are not supported by other findings in the study. First,
there were no differences in efficacy or implementation ratings for either teachers
or tutors across the instructional groups. There was no evidence of any interaction
of tutoring and Open Court. Second, there is no evidence of interactions between
ethnicity or any other sociodemographic variable and instructional group. There
was evidence for individual differences on outcomes, but not for impacts of
these types of individual differences on instruction. Third, we provided the
bulk of the training and monitoring in all the research conditions. The amount
of training provided by Open Court was in fact quite limited. The people directing
the embedded phonics and whole language interventions were in fact experts in
these methods and are widely acknowledged teacher trainers. We did have "a whole
language expert" supervising the whole language intervention. Fourth, there
were no differences in initial reading scores and not only were these not statistically
significant, but the mean differences are not even practically significant in
the initial assessment. We do have "pre-test data", only not on the norm-referenced
tests that have limited sensitivity to change over a nine month period. Fifth,
eligibility for a Federal lunch program does not mean that the child has "extreme
poverty." Title I eligibility is school-level for lunch and child-level for
literacy. Moreover, the fact that the children are fed at school should prevent
them from being "hungry during lessons or testing". Instructional groups did
not differ on this variable. Sixth, many of second graders could not even read
when the intervention was done. This occurred across instructional groups, including
Open Court. We don't report the standardized test results separately because
they are not different for first and second graders.
The notion that we made no "attempt to reflect on the problems of their
design" is inconsistent with the Discussion section of our paper. In this section,
we acknowledged nine different limitations of the study. We note again that
random allocation of subjects to groups is not a requirement of statistics that
we or anyone else uses, and is not equivalent to proper research design. Quasi-experimental
studies are commonly used to make policy decisions. Do you smoke? Has anyone
been randomly assigned to smoke or not smoke, or to drink or not drink? Your
comments about tutoring programs are taken out of context. We clearly intended
to represent concerns about the use of volunteer tutors and these comments have
nothing to do with "private reading specialists". In fact, we would not criticize
"private reading specialists" for what they do. Rather, we would criticize public
schools for not providing interventions that are consistent with what is happening
in a tutoring program. If you read our other papers, you would note that we
have praised tutorial interventions, such as those done by Torgesen, Vellutino,
and Slavin. We are very aware of their research done on "whole classroom", and
our Introduction to our paper talked extensively about the work by Slavin, Engleman,
and others. Finally, we don't think our study has set back scientific research.
Rather, we hope that it has accomplished what individuals like Stanovich and
Raudenbush have indicated, which is that the study has raised the standards
for scientific investigations of reading research because of the methodological
sophistication.
We disagree that we are irresponsible and note that the attempt to simply
denigrate research based on inadequate understanding (or whatever your motive
was) is truly irresponsible. It is a disservice to publish inaccurate descriptions
of other individuals' research and to attribute motives to individuals that
don't exist. If you have such strong concerns, why don't you write them up as
an article and send them to the Journal of Educational Psychology. In its present
form, it won't be published because it is inaccurate and naive.
Sincerely,
Jack M. Fletcher, Ph.D.
Barbara R. Foorman, Ph.D.
Chris Schatsneider, Ph.D.
David Francis, Ph.D.
Back to McGuiness article