Foorman Responds to D. McGuiness' Article

June 19, 1998

Diane McGuinness, Ph.D.
University of South Florida -
St. Petersburg
140 7th Avenue South
St. Petersburg, FL 33701

Dear Dr. McGuinness:

We were recently bombarded by faxed copies of a review of our study published in Journal of Educational Psychology this past March, which you apparently completed for Parenteacher magazine. We are writing to express our concern about the tenor and inaccuracy of your review. We were particularly concerned because many of the issues that you raised have been addressed in other venues by our group, some of which you have clearly read. We don't understand the need to distort and misrepresent our research, much less the need to denigrate research that your review suggests you don't really understand. We are also concerned because of the misunderstanding of NICHD research embedded in your review. We will address some of our concerns on a point-by-point basis.

In the Introduction, you implied that there is a difference between what we report at conferences and during testimony and the actual published report. This is not the case. For example, we have never claimed that our study pits phonics against whole language. We have always noted that the instructional principle of primary interest is the explicitness of instruction in the alphabetic principle necessary to facilitate early reading skills in high risk populations. The 1997 paper it is clearly described as "preliminary findings".
The depiction of large scale and small group studies is puzzling. As we're sure you are aware, the NICHD has funded research on reading skills since 1965. The size of the studies varies. However, your notion that the "small group model" is tied to certain types of statistical methods is incorrect. For example, Fisher developed the analysis of variance as a shortcut because of the amount of time required to compute discriminant functions prior to the advent of computers. He viewed it as a limited method. There is no strong relationship between certain types of statistical methods and variations in experimental design. Because of limitations of these types of models, some of which are inherent in the absence of high speed computers, it was often necessary to design studies that would correspond to the limitations of existing statistical models, such as ANOVA. The notion that "random assignment of subjects to conditions ... is essential" is not correctly attributed to Fischer and is not a requirement for any parametric statistic. For example, ANOVA has only two primary assumptions. The first is normality, an assumption to which ANOVA is usually robust. The second is independence of observations, to which ANOVA is not robust. There is no requirement for random assignment as far as statistics are concerned. Certainly, as Campbell and Stanley (1962) outlined, random assignment has considerable influence on the strength of inferences that can be made from different kinds of experimental designs. However, even when random assignment is not possible, which is standard in most instructional research, inferences can still be made from strong quasi-experimental studies. Since you read our explanation of this point in our response to Denny Taylor, it is surprising that you bring up this issue as a problem. Moreover, it is astonishing that you would suggest that random assignment is required for parametric statistics. The notion of how a "small group model proceeds" is arbitrary and artificial, followed in no systematic way by any reading researcher whom we know.
Benita Blachman is our colleague. We have worked closely with her for several years. You may be interested to know that in her current intervention studies, we are providing some of the methodological expertise. Her current studies now are designed to use multi-level models of the type commonly recommended for instructional research and are designed so that individual change models can be used. Individual change models are important because they avoid many of the problems of traditional pre-post design and statistics such as ANOVA, which don't permit analyses of individual change and are hampered by the widely acknowledged problems evaluating difference scores.
NICHD has no particular mandate "to look at every variable that might impact on reading, and to use large numbers of children in each study …" Most of the NICHD research is investigator initiated. It receives an independent review by a group of peers on a study section. Program is independent of review. Program cannot fund studies that don't pass muster under peer review. Any grant funded must be at a certain level of quality based on comparisons across the entire NIH. This is the same model used to fund all biomedical research in the National Institutes of Health. Although Program might recommend "large scale studies", the reality is that the NICHD program directed by Reid Lyon funds research at all levels, involving many different types of designs and many different sample sizes. The notion that "scientists" would "be doing the testing and data entry themselves" is not correct. This actually depends more on whether the investigator has funding. If they have funding, you can be assured that they will typically hire research assistants to do data collection and data entry. This is a technical skill that does not require a "scientist."
The notions about the mismatch between the number of students and number of variables, absence of explicit experimental design, and use of the wrong statistical tests is simply incorrect. Moreover, your depiction of the purpose of the study is incorrect, as is your description of the program. Open Court's (1995) Collections for Young Scholars is not a phonics program. It is a balanced approach that includes phonemic awareness, phonics, and a significant emphasis on literature.

We clearly conceptualized our study from a traditional analysis of variance model. In fact, as we explicitly described in our published paper, this study was conceptualized as a multi-level design in which children were nested within classroom within schools. There is a substantial literature in the social sciences on the importance of multi-level models for estimating sources of variability attributable to different components of a particular educational practice or intervention. Multi-level models were developed in part because of the limitations of models based on analysis of variance designs. Your comments clearly indicate that you do not understand this area of a educational research despite the amount of literature in these areas. Hence, your notion of the experimental design is simply incorrect. There are indeed four methods in two grades. There are three levels of ethnicity represented, but this was never conceptualized as a between-subjects factor. Moreover, all the children were Title I eligible, so there are not two levels of poverty. You are confusing the instructional curriculums with differences within the schools. Your idea of some sort of strict correspondence between number of subjects and factors is erroneous. Even in an analysis of variance design, this would depend on power. All the decisions that we made concerning collapsing of grade 1 and grade 2 analyses, the tutorials, etc. were based on statistical analyses demonstrating no effects of these components As we stated in the paper, we chose to evaluate variability due to age (a continuous variable) and not grade. Because they had no effect on instructional outcomes, this is what was reported. Again, you state that "there was no random assignment to classrooms, a requirement of the statistics employed." This is not a requirement of statistics. It has more to do with how the results are interpreted and the strengths of causal inferences that can be made from the design. The bottom line is that you apparently do not understand multi-level, individual change models, which is unfortunate.

Many of your descriptions concerning the assignment of teachers to methods, training, tutoring, nature of the schools, etc. are nothing more than guesses on your part. Since you read the posts, you know that your depiction of our relationship with Open Court is incorrect. Open Court provided several days of in-service training for our teachers, consistent with the amount of training in a particular curriculum that our instructional leaders provided in the other curricular areas. At the time of study, McGraw-Hill did not own Open Court. Open Court was selected by the school district because the district used Open Court Math. Materials were provided to the district free of charge as a pilot basis in 1994-95, prior to release of the 1995 program. We have no idea of how you could conclude "the Open Court classrooms were over twice as likely to be in the District's lower poverty schools." All the children served were Title I eligible because of the school-level designation and their literacy levels. There were no differences in age, ethnicity, free lunch programs, or any sociodemographic variable across the instructional groups. As the paper states, we controlled for school effects by placing more than one curriculum in a school. We controlled for tutor effects by asking tutors to deliver more than one tutorial approach – either the district standard (Reading Empowerment – not Reading Recovery) or Open Court or Embedded Phonics. Our question is why it is necessary to distort the nature of the study as well as the results. As Keith Stanovich noted in his 1997 Causey Award Address, reading professionals tend to deal with issues by distortion of the database when in fact we should all share in the rich of the database that we share.
Your depiction of the measures is not correct. You seemed to have confused "norm-referenced" with "standardized". In fact, both our word reading tests and the Torgesen-Wagner battery have extensive psychometric research supporting reliability and validity. It is absolutely correct that the Woodcock-Johnson has insufficient number of words to detect change. This is why the word list is necessary. The word reading list correlates .88 with the word reading subtest from the Woodcock-Johnson. The performance of children on the test (i.e., the relationship of means and standard deviations) has little to do with the reliability of the measures, but says more about the variability of the children. You can't conclude that the reading tests has no validity based solely on student performance. How you concluded that the tests are "invalid" is frankly ridiculous and is little more than attempt to denigrate based on your own lack of understanding.
Many of your descriptions of the results are not correct. As the paper states, we eliminated components of the phonological awareness tests because they were highly correlated (r >.9). The results occur whether we use analysis or synthesis subtests. The differences on the word list were statistically significant, even with a conservative alpha adjustment utilized. This was apparent in the growth curve analyses, reflecting effects on both slope and intercept. T tests that were conducted were used to follow-up what you describe as "ANOVA" statistics, which is not a correct depiction of what was done. We performed overall tests and followed "significant" overall tests with the t tests. The comments about the April scores reflect only the intercept effect, not the effect of slope. They ignore completely the log linear analyses that were done. We did not give pre-tests on the norm-referenced educational tests because we did not think that such analyses were necessary, given the amount of information that we had on change over time during the year. Surely you are aware of all the issues with the computation of statistical significance in statistical designs based on difference scores, pre-post designs, etc. As we reported, there were no differences on the initial assessment which could be represented as a pre-test if that was the type of experimental design that was being employed. Your suggestion that the only significant group comparisons involved the Woodcock-Johnson and Kaufmann Spelling is incorrect. You are ignoring the growth analyses for word reading and for phonological processing. Surely you are aware of all the controversy over the emphasis on statistical significance and you seem to ignore the information that we provided on effect sizes. Your statements such "60% of the first and second grade populations scores was higher than these children" is not correct. It is impossible to make inferences like this from an average percentile rank. The 90 untutored children cannot be used as "control" because of the possibility of school effects and the way in which decisions about tutoring were made by the District and by the teachers.
The notion of "a variety of explanations for the one significant result" is not correct and are not supported by other findings in the study. First, there were no differences in efficacy or implementation ratings for either teachers or tutors across the instructional groups. There was no evidence of any interaction of tutoring and Open Court. Second, there is no evidence of interactions between ethnicity or any other sociodemographic variable and instructional group. There was evidence for individual differences on outcomes, but not for impacts of these types of individual differences on instruction. Third, we provided the bulk of the training and monitoring in all the research conditions. The amount of training provided by Open Court was in fact quite limited. The people directing the embedded phonics and whole language interventions were in fact experts in these methods and are widely acknowledged teacher trainers. We did have "a whole language expert" supervising the whole language intervention. Fourth, there were no differences in initial reading scores and not only were these not statistically significant, but the mean differences are not even practically significant in the initial assessment. We do have "pre-test data", only not on the norm-referenced tests that have limited sensitivity to change over a nine month period. Fifth, eligibility for a Federal lunch program does not mean that the child has "extreme poverty." Title I eligibility is school-level for lunch and child-level for literacy. Moreover, the fact that the children are fed at school should prevent them from being "hungry during lessons or testing". Instructional groups did not differ on this variable. Sixth, many of second graders could not even read when the intervention was done. This occurred across instructional groups, including Open Court. We don't report the standardized test results separately because they are not different for first and second graders.
The notion that we made no "attempt to reflect on the problems of their design" is inconsistent with the Discussion section of our paper. In this section, we acknowledged nine different limitations of the study. We note again that random allocation of subjects to groups is not a requirement of statistics that we or anyone else uses, and is not equivalent to proper research design. Quasi-experimental studies are commonly used to make policy decisions. Do you smoke? Has anyone been randomly assigned to smoke or not smoke, or to drink or not drink? Your comments about tutoring programs are taken out of context. We clearly intended to represent concerns about the use of volunteer tutors and these comments have nothing to do with "private reading specialists". In fact, we would not criticize "private reading specialists" for what they do. Rather, we would criticize public schools for not providing interventions that are consistent with what is happening in a tutoring program. If you read our other papers, you would note that we have praised tutorial interventions, such as those done by Torgesen, Vellutino, and Slavin. We are very aware of their research done on "whole classroom", and our Introduction to our paper talked extensively about the work by Slavin, Engleman, and others. Finally, we don't think our study has set back scientific research. Rather, we hope that it has accomplished what individuals like Stanovich and Raudenbush have indicated, which is that the study has raised the standards for scientific investigations of reading research because of the methodological sophistication.
We disagree that we are irresponsible and note that the attempt to simply denigrate research based on inadequate understanding (or whatever your motive was) is truly irresponsible. It is a disservice to publish inaccurate descriptions of other individuals' research and to attribute motives to individuals that don't exist. If you have such strong concerns, why don't you write them up as an article and send them to the Journal of Educational Psychology. In its present form, it won't be published because it is inaccurate and naive.

Sincerely,

Jack M. Fletcher, Ph.D.
Barbara R. Foorman, Ph.D.
Chris Schatsneider, Ph.D.
David Francis, Ph.D.

Back to McGuiness article