D.Taylor Response: April 6, 1998

We appreciate Denny Taylor’s attempt to review our study. In making this response, we would like to note that the only communication about the Foorman study to which we did not provide a detailed response was her original post to us. The assertion that we have refused to answer questions about the study is not true. To reiterate, the reason that we did not respond to that post was that many of the questions were addressed in the manuscript that just appeared in the Journal of Educational Psychology this month. To respond to the original post would have been extremely time consuming and we did not feel that a response would be productive since the level of detail requested was provided in the manuscript. This problem remains, but the present set of questions is easily answered, so we will respond.

We would also like to note that we deliberately chose not to release the manuscript to anyone prior to its publication. There are apparently several copies of the paper in circulation, but they were not obtained from us. The reason that we have not released the paper prior to publication was not to withhold data or information, but to prevent inappropriate uses of the findings in this study and to observe APA guidelines.

The notion that we are reneging on promises to release the actual data is incorrect. We have not retracted our offer to ship the data and manuscript to both Dr. Goodman and Dr. Taylor, although the paper should be in their libraries by now. We are currently preparing to make the dataset used in the publication available via the internet. However, we have no obligation to release unpublished data and have every right to ensure that we control the publication of our own data. We welcome thoughtful re-analyses of the published data and will continue to respond to questions concerning the study.

Not releasing this paper and others prior to publication reflects a more general effort to try and support sharper interpretations of the data in these studies. For example, we have written several letters to newspapers indicating that the Houston study did not support phonics over whole language, but represented a test of an instructional hypothesis involving the explicitness of teaching of the alphabetic principle to children at risk for reading failure. While correcting misinterpretations of our conclusions, we have repeatedly requested citations to empirical research which support a different conclusion, so that we could consider possible explanations for different findings and attempt to design a study to evaluate those explanations. Such references have been hard to come by, and we have seen little data that refutes the conclusions we have drawn from the existing research, which makes us wonder why those conclusions have been so persistently attacked. As with the op-ed pieces, we have frequently attempted to correct Dr. Goodman’s misinterpretations of the study, and those of others, and now will address Dr. Taylor’s misinterpretations of the study.

The research design and execution was biased in favor of the explicit instruction/Open Court treatment group.

To reiterate previous responses, the study was not biased in favor of any particular group, nor is there any relationship between our group and the current or previous owners of Open Court. In fact, we approached Open Court at the recommendation of the school district in which we were conducting the research. We proposed to use DISTAR, but it was not acceptable to the District. District officials suggested Open Court because the District used Open Court Math. When we approached Open Court, which was not at the time owned by SRA /McGraw Hill, we learned that there was a new edition of Open Court Reading. We did not have sufficient funds to purchase Open Court because it was not budgeted in the original proposal. Open Court generously provided us with the pre-publication version of the curriculum, and some assistance with implementation, but teachers who used Open Court did not receive more intensive training. We have no idea whether SRA/McGraw Hill profited from their purchase of Open Court, nor does the sale/purchase of Open Court have any bearing on our research, and is of no concern or relevance to us. Texas has not had a state adoption in reading since before our study was conducted, so the statement that Open Court has been adopted in Texas is erroneous. Frankly, we hope the original owners of OC have profited from the sale of Open Court to SRA/McGraw Hill because the authors and the original owners had the courage to subject the curriculum to an empirical study over which they had no control. Hopefully, other companies will follow suit. We believe that curriculums lacking external evaluation will not do well in Texas when we have our next adoption.
There is considerable evidence that some of the key results of the study were misrepresented in favor of a Direct Instruction/Open Court treatment group.

The example given concerns the unseen comparison group, which purportedly had higher scores on the Formal Reading Inventory than the children in the "Direct Instruction/Open Court" treatment group. As we reported in the paper and have stated on more than one occasion on this list, there were no significant differences among any of the instructional groups on the Formal Reading Inventory. Sampling variability is the most reasonable explanation for the slight, nonsignificant mean differences on the FRI. In fact, this test had floor effects and none of the groups did particularly well on this measure. The range of mean scores across the four groups was 80.8-83.1 - hardly a practical, much less a statistically significant, difference. In contrast, the group that received explicit instruction in the alphabetic principle (Open Court) had significantly higher scores on the cloze-based Passage Comprehension subtest of the Woodcock-Johnson-Revised. Keep in mind that the FRI was not administered to students who had less than 5 raw score points on the WJR Passage Comprehension subtest because children scoring that low on the WJ-R are unable to do the FRI. This involved 9% of the OC group and 21-24% of the students in the other groups. Thus, a much larger percentage of the explicit instruction group was evaluated using the FRI. Dr. Taylor has clearly misinterpreted the Formal Reading Inventory data.
The samples were biased in favor of the Direct Instruction/Open Court treatment group.

This criticism focuses on the District Standard comparison group and the fact that this group appears to be different on several background variables from the remaining three groups. It is very important to recognize the role of this group in the study. This group is only in the study as a comparison for the research-based implicit instruction group. We make no comparisons between the district standard implicit instruction group and the explicit instruction or embedded instruction groups. These two groups are only compared to the research based implicit instruction group. Thus, the district standard implicit instruction group has played no role in our comparisons of, or conclusions about, the differences between explicit and implicit instruction groups. These comparisons and conclusions exclusively involve the research trained groups. The district standard implicit instruction group was included in the study as a control over the training of teachers. Our interest in including this group was to demonstrate that training of teachers makes a difference in pupil outcomes in the implicit instruction curriculum, and to demonstrate that students with research trained implicit instruction teachers were doing at least as well as, if not better than, students with teachers supplying the district standard curriculum. In fact, there was a clear tendency for children in the research trained implicit instruction classrooms to do better than children in the district standard classrooms. Keeping in mind the complete irrelevance of this latter group to the conclusions regarding explicit vs. implicit instruction, we have clearly acknowledged that the comparisons involving the unseen comparison group may reflect a school effect. Not only is this school somewhat different than the other schools sociodemographically, it was widely regarded as a “tough school” by District officials. The paper documents other evidence showing that the unseen comparison group sample is different (i.e., school effect). However, the purpose of this comparison group was to assess the effects of training as stated above. The demographic differences between this group and the research trained implicit instruction group has made it difficult for us to make any strong statements about the role that training of teachers played in the outcomes of the study, but these differences cannot explain the results comparing direct instruction and research-based implicit instruction, which are the basis for our conclusions.

Dr. Taylor’s assertions concerning the basis for tutorial services are erroneous, but are fully understandable. Within the District, children were given the emergent literacy survey and then afforded access to tutorial services based on (1) their score on the emergent literacy survey, with the lowest scoring children being served first, and (2) the availability of funds for providing those services. Dr. Taylor is not in a position to know that tutorial services were determined by the availability of funding, not simply by scores on the District’s emergent literacy survey. There were more children in Open Court schools who did not receive tutoring than in the other interventions, but this was not because the children were not eligible for services on the basis of their emergent literacy scores. In fact, some children in the explicit instruction group did not receive tutorial services because the school in which they were enrolled had elected not to allocate funds for tutorial services for children in those grades. That is, children in those classrooms were provided tutorial services because of a District decision on the allocation of Title 1 services, not because their emergent literacy survey scores were better than those in other classrooms.

he reviewers of the paper just published in the Journal of Educational Psychology were very concerned about the tutorial effects. At their recommendation, children who did not receive tutoring were dropped from the analyses that we report in the published study. Despite the reduction in power, there were no major differences in the overall pattern of results or the conclusions about curriculum effects when the analysis was restricted only to children receiving tutorial services. We have no idea where Dr. Taylor drew the comparisons of tutored and non-tutored groups, but the issue is irrelevant to our conclusions about curriculum effects because the effects were seen in children who obtained tutoring, who represent the bottom 18% of this Title 1 population. If school effects bias in the samples were operating to the outcomes, then why weren’t there differences in reading skills at baseline, and why do we not see differences in vocabulary?
Additional accelerated instruction was provided only to the Direct Instruction/Open Court group.

When the study began, it was immediately apparent that many of the second graders in the study were unable to read. Because Open Court is a basal reading program with organized lessons plans, children in the second grade began in the Grade 1series and received two Grade 1 lessons in the language arts block. In other programs, teachers simply reduced the level of the curriculum to the child’s level of reading abilities. These other programs were not scripted. The extra lessons were only done with second graders. However, there was no difference in the amount of time spent on language arts instruction across the four groups. In addition, it is important to recognize that the intervention effects are observed even if only the first grade data are analyzed. Doubling up on lessons for second graders was simply a modification of the Open Court lesson plan to accommodate the extremely deficient reading skills of the second grade children. But again, the effects are seen, and are statistically significant, in the first grade children when analyzed alone, making it impossible for this aspect of the study to explain the curriculum group differences.
The numerous defects in resulting statistical uncertainties make any conclusions in favor of Direct Instruction/Open Court nothing more than complicated guesses based on the biases inherent in the research.

The statistical modeling of individual growth rates, the statistical methods, the statistical assumptions, and the statistical analyses are unverifiable, false, or were inappropriate, as well as simplistic and biased.

We regret Dr. Taylor’s failure to appreciate the statistical approaches used in this study. As we noted once before, the intervention effects that were observed are apparent if the data are simply graphed and examined visually. The purpose of the statistical analysis is to estimate the size of the effects, their variability, and the correlates of this variability. The approach that was used involves individual growth curves analysis, along with more traditional statistics. These methods are championed because of their ability to examine individual growth and development as well as to identify characteristics of individuals and groups of individuals that relate to individual differences in growth. In addition, these methods permit analysis of nested designs in which children can be nested within classrooms and within schools. This is a major contrast with previous intervention and curriculum studies because of the possibility of modeling change at an individual level and the capacity for identifying different influences on individual change. This is one major reason why these types of studies are significantly different from the types of curriculum studies done in the 60s.

Many of Dr. Taylor’s assertions about the analyses are incorrect. There is no required assumption that the samples are randomly drawn. Random sampling is necessary for us to generalize our findings to the target population. The target population in this study is the population of children in the country served by Title 1. The accessible population was the children eligible for Title 1 services in several schools in the district where we were working. We did not sample these children; we took all of the children in those schools who were eligible for services. Thus, the issue comes down to whether or not the Title 1 children in these schools and this school district are representative of the Title 1 children in the rest of the country. We can only speculate on this representativeness based on the demographic information that we have collected on the children and their families. That is one reason we believe strongly in the need for replication. But whether children are sampled randomly from the population or not has no bearing on the issue of comparing the curriculum groups. The validity of this comparison is an issue of internal validity, and the key to internal validity is random assignment, not random selection.

Random assignment is an important safeguard of internal validity. As in most studies that take place in real life situations, we were unable to randomly assign teachers to curricula and we were unable to randomly assign children to teachers. Administrators, teachers, parents, and children will not be surprised that we were unable to do so. This situation is not atypical in large scale investigations in education, psychology, and other social sciences. Despite its typicality, lack of random assignment is potentially a serious concern that necessitates a thorough analysis of the data and consideration of other background factors that could account for the differences between groups other than the curriculum assignment. Lack of random assignment complicates the analysis and interpretation of the data, which is just one more reason why we believe "replicability" and replication are so important to educational research.

There are many situations where random assignment is not possible. Most people believe that smoking causes cancer in humans, but there is not a single study involving humans where subjects were randomly assigned to be smokers or non-smokers to support this claim. Nevertheless, over the years, a substantial body of evidence has accumulated from quasi-experimental research that has supported this conclusion, which is now widely accepted. In the case of the current study, random assignment of teachers to curricula would have raised a different set of concerns. Some might complain that if teachers were randomly assigned to a curriculum, they could not be as effective in delivering that curriculum because they (1) wouldn’t know it as well, or (2) might not believe in it, etc., etc. They might complain further that such assignment differentially impacts on the delivery of an implicit instruction curriculum. That may well be true. As this example makes clear, even random assignment of teachers to curricula would not solve all problems; it would trade off some concerns for others. Our study is not a “true experiment” in the sense that random assignment was used, but we believe the study represents good quasi-experimental research, involving multiple schools, teachers, children, time points, and outcomes. We have tried to measure the relevant subject and teacher characteristics that could potentially explain differences between curricula that are, in fact, not attributable to curricula. Even when these factors are controlled in the analysis, differences among curricula remain. Nevertheless, only replication will help us to know the extent to which we have been successful in resolving these issues in the current study.

Having said that, we feel it is also important to provide more explanation about the analysis process. Dr. Taylor is correct in stating that the analysis-like any statistical analysis- involves assumptions. In fact, it is incumbent on us to evaluate the extent to which those assumptions seem plausible, and to assess, if possible, their impact on the conclusions that we have drawn. In that regard, the sample distributions are studied extensively. Prior to any form of analysis, the data are plotted and any problem with lack of normality is examined. However, the methods do not assume a normal distribution for the observed data. Rather, they assume a normal distribution for the growth parameters and for the residuals. We routinely examine these assumptions by examining the distribution of residuals and growth parameters. The notion that the measures used in the growth analyses must be interval-based is not correct. This is desirable, but not required. Interval measurement is a necessary assumption for us to place any interpretation on the natural relevance of the growth parameters, i.e., to assume that we have captured the true shape of the growth trajectory. However, it is not a necessary assumption for us to model the observed growth trajectories and to compare growth parameters across curriculum groups and children. Nevertheless, we believe that the measures exhibit properties that support interval based interpretation. For example, the phonemic awareness measure was comprised of 105 items that are scored as correct or incorrect. The word reading list was comprised of 50 items that were scored correct or incorrect. Teachers often give students assessments where they score the items and then average the scores from different assessments in the same subject area. Those assessments are typically made up of items that are awarded points depending on the answer. Teachers might add up the points across the items to compute a total. When teachers do add items to get a total score, or average scores from different assessments, they are treating those assessments as though they have interval properties, even though they may not possess such properties in a strictly technical sense. This is an interesting psychometric point, but probably not one that teachers worry much about, nor should they as they are trying to decide what evaluation to assign to the different children in their class. What we have done is really not much different from averaging the scores on these tests with large numbers of items scored correct and incorrect. We fit a line to the scores for each child . These lines may vary considerably in their slope, and may be linear or nonlinear. The slope of the line tells us the rate of the child’s improvement on the skill and the intercept of the line tells us the average of the time points that made up change. If the line is flat, the intercept is just the average of the points. Statistically, the process is a little more complicated than that because we want to estimate the line as precisely as possible given the data that we have on the child and all that we know about the child, but it’s really not much different.

Dr. Taylor asserts that the methods we used were biased in favor of the explicit instruction group, but this is not true. The methods we used, hierarchical linear models, estimate the slope and intercept for an individual as a weighted average of two estimates, one based only on the data from the child and one based on the average data for all similar children (i.e., children in the same curriculum, same age, gender, etc.). The weights are related to the quality of the information. The more data we have on the child, and the more precise the measurement of the child, the more weight individual measurement receives so the less weight is given to the group level data. Thus, at the individual level, a child’s estimated growth parameters are “biased” in the statistical sense that the long run average of the estimates is not the child’s true parameter. Rather, the child’s estimate has been “shrunken” back towards the mean of that child’s group. It can be shown statistically that this shrunken estimate is actually optimal in the sense that it is closer to the child’s true value in the long run, even though it is not “unbiased.” The same kind of shrinkage is taking place at the teacher level. In essence, the statistical models that we use actually work to pull everybody back toward the middle, i.e., they work to reduce differences. Thus, groups that perform below average are pulled up toward the grand mean, and groups that perform above average are pulled down toward the mean. The further a group lies from the mean and the less precise the group level data, the more they are shrunken back toward the mean. Thus, it is hard for us to see how this aspect of the statistical modeling could account for the differences among instructional groups because it operates to pull all groups toward the overall mean. But perhaps Dr. Taylor was referring to some kind of bias of which we’re not aware.

Why do we think we’re justified in saying that the tests behave like they have interval properties? Since this paper was submitted, an item response model has been applied to the item level data of the primary outcome measures, which puts the data on an interval scale. We have no evidence that modeling raw scores in the published paper biased the results for these primary outcomes in any way.

We also note that some of the outcome measures are scores from standardized tests, but why these would be labeled "nominal or ordinal" is not clear to us. In fact, the Woodcock-Johnson tests that we used are based on the Rasch IRT model, which yields interval based scores.

Dr. Taylor also asserted that we assumed that change was linear and that we could summarize the data from a complex collection of individuals into a single line. In fact, there was no assumption that change was linear. We tested a variety of models of change, including "straight line" and curvilinear models. In most instances, the model that best characterizes the growth data is quadratic because there is acceleration of growth (in the case of word reading) or deceleration of growth (in the case of phonemic awareness) in the latter part of the school year. There is no assumption of linearity. In addition, there is no attempt to "discard the individual data for each child." The heart of the method is an analysis of change at an individual level. The fact that the lines have different slopes and intercepts, and possibly different curvatures (although we did not find this in our analyses), does not prevent us from estimating the average slope or intercept. In fact, the method that we used to analyze these data is the best method available for making sense of longitudinal data in developmental contexts, such as this one. We refer Dr. Taylor to writings by John Willett, David Rogosa, and Steve Raudenbush, on this particular issue. Interestingly enough, if change were linear at the individual level (i.e., characterized by a straight line for each individual, albeit with possibly different slopes and intercepts), the average of the individual lines is, in fact, the line through the averages. That is, the linear model is said to have the property of dynamic consistency, a fact of which Dr. Taylor seems unaware.

Mr. Taylor further asserts that the statistical methods are naïve and unsupportable. The methods analyze change at an individual level, evaluate sources of variability that correlate with this individual change, and can be linear, non linear, etc. We are confident that the approach we have taken to these data affords the most complete and comprehensive analysis of these data. The general approach that we have taken is the only approach (i.e., class of statistical models) available that allows to simultaneous deally with the fact that time is nested within student, and student with the teacher. Furthermore, we believe in the safeguards afforded by peer review. We have confidence that the Editorial Board of the Journal of Educational Psychology saw that our manuscript received fair and competent review, including competent statistical review. With respect to the analysis of longitudinal, educational data, Dr. Taylor’s views are not consistent with the mainstream of educational research. Her characterization of our methods ignores major advances in statistics that have occurred in the last 20 years and fails to consider the methodological contributions that we have made not only in the area of reading research, but in other areas of psychological research.

Conclusion

We do not assume that "training children to read words and pseudowords will enable them to read cohesive text." What we do claim is that children who are unable to read words and pseudowords will not be able to read text at age level. We also claim - like many others - that the proximal cause of reading failure most commonly occurs at the level of the single word. The results of our study clearly demonstrate that this is indeed the case. Like all quasi-experiments, our study is not without limitations. However, those limitations are not the biased and subjective criticisms put forward by Drs. Taylor, Goodman, and Coles. The real limitations of the study are clearly outlined in the Discussion, as follows:

Subjects were not randomly assigned to treatment groups. This makes the study quasi- experimental, as opposed to a true experimental study. The lack of random assignment, however, is not a fatal flaw. The issue is how well we have measured other potential explanations of the findings, which is why we go to great lengths to identify potential school effects, monitor teacher compliance, assess multiple outcomes, etc. Quasi- experimental designs are standard in most forms of intervention research, particularly in areas where it is not possible to randomly assign children to schools, much less to treatments. Hence, a limitation is always the possibility that some variable that would explain the outcomes was not measured. Neither our group nor the reviewers were able to identify such variables, but that is always a distinct possibility, which is why we advocate so strongly for independent systematic replication.
An alternative interpretation of results of the study is that they simply reflect the effects of including a basal reading program. In other words, it may not be the components of Open Court that explain the results, such as the explicit teaching of the alphabetic principle, emphasis on applications in real literature, inclusion of real literature, etc. It is possible that the inclusion of any form of a basal reading program would have produced the sorts of results that we obtained. This is a possibility that we are presently investigating. We don’t think this is an adequate explanation of the results. We believe that that the differential outcomes reflect the variations in the components of the programs, but that is why more research is needed. Beliefs are not sufficient, and need to be treated as hypotheses that are potentially disconfirmable.
The study does not show that phonics works better than whole language. In fact, this conclusion is not possible because we did not have a phonics- only condition, and teachers taught phonics in all curricula. The issue that was addressed was how explicitly this instruction needed to be provided. In this study – like others involving high risk populations – explicit instruction was necessary, apparently to impact phonological awareness skills. However, Open Court is a balanced approach to reading instruction. This phonics-only interpretation has been promoted by the media and by extreme interpretations by pro-phonics groups as well as by Dr. Goodman and Dr. Coles.
The problems with the “unseen comparison” school are clearly described. We did not monitor classroom reading instruction and cannot determine the extent to which the behavioral and academic problems in these classrooms were the consequences of instruction. However, these concerns do not apply to the three research conditions, the comparisons of explicit and implicit instruction, nor the conclusions reached about the relative superiority of explicit instruction in the study.
The differences on reading comprehension measures were not as robust as the differences in word recognition skills. This does not mean that instructional group differences were not found in reading comprehension. In fact, effect sizes were large and favored the group that received explicit instruction in the alphabetic principle. However, they were not as robust as the word recognition differences and we have considered how programs could be modified to enhance the outcomes in comprehension. It is interesting to speculate whether a given modification would have the same impact irrespective of the curriculum.
The schools were not able to maintain the proposed student-teacher ratios in the tutorial groups, nor obtain tutoring for all children. This issue was addressed by focusing only on children who received tutoring. There was no evidence in any of the analyses for any effects of tutoring, but this component of the study was weak.
Longer-term follow-up of these children is clearly indicated in order to determine whether the gains in reading skills continue to accelerate in the explicit instruction group and the extent to which these represent long-term effects on other aspects of the reading process. There was evidence, for example, that the group that received implicit instruction had more positive attitudes towards reading. These positive attitudes could lead to better long-term gains, although these positive attitudes are not likely to be maintained if the children don’t learn how to read print. It is possible that varying sequence in instructional method would make a difference. The influence of subsequent instruction in alphabetic and orthographic rules in grade 2 could potentially ameliorate the lack of growth observed in grade 1. It may be that the effects of the embedded code instruction will require a longer time to generalize to word and text reading. We are presently investigating this possibility. To summarize, it is entirely possible that these gains observed in the first year would not be persistent, or that the observed efficacy of some of the interventions will attenuate.
The positive effects of explicit instruction did not generalize to all achievement areas, particularly spelling. There were no significant instructional group differences in spelling outcomes.
All instruction took place in a print-rich environment with a significant literature base. Although there were distinct instructional activities for alphabetic work and for literature in the explicit instruction condition, both kinds of activities were provided.

Many of the questions raised by Dr. Taylor are addressed in the manuscript just published in the Journal of Educational Psychology (March, 1998 issue). Weaknesses of the study are described and explicitly addressed. There is clearly a need for more research, which we have initiated. This is only one study, although it is consistent with a large body of research on reading skills development in normal children and intervention research in high risk children. We welcome opportunities to discuss the results of this study and particularly encourage attempts to replicate. If other approaches to analysis show different results, we would like to discuss these differences. We will encourage others to submit to peer review and look forward to the results of further scrutiny.

Barbara Foorman
David Francis
Jack Fletcher
Chris Schatsneider

Read D. Taylor Response: March 28, 1999

back to top