September 2013: Examining Learning Disability Classification Accuracy

Macmann, G. M., Barnett, D. W., Lombard, T. J., Belton-Kocher, E., & Sharpe, M. N. (1989). On the actuarial classification of children: Fundamental studies of classification agreement. The Journal of Special Education, 23(2), 127–149.

Summary by Dr. Jeremy Miciak

Historical Context of the Study

In 1976, specific learning disability (LD) was established as a disability category eligible for special education services. In 1977, federal regulations specified that LD should be identified by documenting a significant discrepancy between the student’s ability (typically indicated by an IQ test) and achievement (typically indicated by an academic subtest, like reading comprehension or math computations). However, federal regulations left it to states to define what constituted a “significant discrepancy.” Different states adopted different identification processes, which often used different formulae for calculating a discrepancy and different measures for aptitude and achievement. States also differed in the extent of their reliance on clinical judgment or test scores only.

In response, many researchers began to investigate the implications of different identification processes and resulting classifications. One important goal was to determine whether classification decisions based on different identification processes were reliable. That is, were the same students classified as LD by some methods and not LD by different methods? This question was important, as special education service eligibility depended on the outcome.

Overview of Classification Research

Classification research investigates the processes and criteria used to identify a disorder. This work is important in many fields because successful treatment often depends on accurate classification. For some conditions, the true status of an individual is known, such that the classification criteria can be evaluated against this “gold standard.” In these circumstances, researchers can evaluate classification decisions against reality to determine the most efficient, sensitive, and specific criteria for identifying a disorder.

However, for many disorders, there is no gold standard that researchers can use to evaluate classification criteria. In these circumstances, researchers must evaluate the reliability of classification decisions and whether the resulting groups differ in some meaningful, external way (Morris & Fletcher, 1988). The essential question centers on whether classification decisions reflect interindividual differences resulting from the disorder or unsubstantial differences related to flaws in the classification criteria.

In classification research investigating LD, there is no such gold standard—no blood test or neurological scan that could indicate who truly has LD. Instead, researchers must investigate different classification criteria for their reliability and potential validity. The reviewed study by Macmann, Barnett, Lombard, Belton-Kocher, and Sharpe (1989) represents an archetypal example of classification research in LD, with lasting implications for the way that LD is identified in research and practice.

Study Design

Macmann et al. (1989) investigated the reliability of classification decisions based on strictly actuarial classifications of LD. Actuarial classification refers to a process by which students are classified based on test scores only.

The authors explored the following two potential sources of error in the identification process to measure their effects on classification decisions:

  1. Two distinct formulae for the calculation of a significant discrepancy: One formula was based on a simple difference of aptitude and achievement (standard score comparison), and the second was based on the difference between the student’s observed achievement score and the achievement score predicted from observed aptitude (regression-prediction method).
  2. Two distinct tests of academic achievement: Both tests measured the same domain (reading) but included entirely different items.

Two studies investigated the implications of the two sources of error. The first study used empirical data from 373 students to compare the classifications resulting from the two formulae and from the two reading tests. The second study used simulated data to illustrate the underlying psychometric principles involved. Simulated data are generated by a computer by using the observed reliability of a test and its correlation with other measures. To investigate the generalizability of the findings from the first study, 5,000 observations were generated.

Study Results

Study 1

  • Correlations across formulae indicated that relative standing in the sample was very consistent, ranging from .89 to .96 (a correlation value of 1.0 indicates perfect agreement; lower values indicate less agreement).
  • When comparing the two formulae, indices of classification agreement were generally in the good to excellent range but rarely met what Macmann et al. argue would be an acceptable level of agreement for diagnostic labeling and educational segregation (kappa > .90). Observed kappa estimates ranged from .57 to .86. Kappa is an index of agreement beyond what is expected by chance. A kappa of 1.0 would indicate perfect agreement; lower values indicate less agreement.
  • The use of different achievement measures with both standard score comparison and regression predicted methods resulted in unacceptably low agreement in classification decisions, with kappa ranging from .23 to .47.

Study 2

  • In study 2, the authors demonstrated that the classification agreement between processes using distinct measures varied systematically according to the correlation of the two measures. Kappa, for example, was .46 when the correlation between measures was .80, but it rose to .70 when the correlation between measures was .95. However, neither rose to the criterion specified by Macmann et al. This finding demonstrates that tests must be highly correlated to achieve good agreement in classification decisions.
  • Agreement on classification decisions was also influenced by the cutoff scores used—in this case, the cutoff score to indicate a severe discrepancy. Through a series of scatter graphs, the authors illustrated how more extreme cutoff scores resulted in less dependable classifications.


  • This study clearly illustrated the limitations of actuarial classifications based on test scores. Among both practitioners and researchers, there is an occasional tendency to conflate a test score with the skill it ostensibly measures. However, it is important to remember that any educational test is an imperfect measure of an unobservable skill (e.g., reading proficiency). It does not measure the skill perfectly, and it is not perfectly reliable. Thus, all classification decisions based on these data will be subject to error.
  • This principle extends beyond methods to calculate a significant achievement-ability discrepancy for the identification of LD. Indeed, all methods that rely on strict cutoffs for classification decisions will demonstrate similar classification errors. This finding has been demonstrated for other, more recently proposed methods to identify LD, including low achievement criteria (Francis et al., 2005), models that identify inadequate response to intervention (Barth et al., 2008; Fletcher et al., in press), or models that identify a pattern of cognitive strengths and weaknesses (Miciak, Fletcher, Stuebing, Vaughn, & Tolar, in press; Stuebing, Fletcher, Branum-Martin, & Francis, 2012).
  • Given the imperfect nature of classifications based on psychometric methods, Macmann et al. call for a reorientation in our understanding of pscyho-educational assessment, endorsing assessment based on a “coherent psychology of helping” (p. 145). Rather than focusing on identifying the right children, assessment can be understood as a means to evaluate and guide instruction. Nearly 25 years later, this call continues to ring true.


Barth, A. E., Stuebing, K. K., Anthony, J. A., Denton, C., Fletcher, J. M., & Francis, D. J. (2008). Agreement among response to intervention criteria for identifying responder status. Learning and Individual Differences, 18, 296–307.

Fletcher, J. M., Stuebing, K. K., Barth, A. E., Miciak, J., Francis, D. J., & Denton, C. A. (in press). Agreement and coverage of indicators of response to intervention: A multi-method comparison and simulation. Topics in Language and Learning Disorders.

Francis, D. J., Fletcher, J. M., Stuebing, K. K., Lyon, G. R., Shaywitz, B. A., & Shaywitz, S. E. (2005). Psychometric approaches to the identification of LD: IQ and achievement scores are not sufficient. Journal of Learning Disabilities, 38, 96–108.

Miciak, J., Fletcher, J. M., Stuebing, K. K., Vaughn, S., & Tolar, T. D. (in press). Patterns of cognitive strengths and weaknesses: Identification rates, agreement, and validity for learning disabilities identification. School Psychology Quarterly.

Morris, R. D., & Fletcher, J. M. (1998). Classification in neuropsychology: A theoretical framework and research paradigm. Journal of Clinical and Experimental Nueropsychology, 10, 640–658.

Stuebing, K. K., Fletcher, J. M., Branum-Martin, L., & Francis, D. J. (2012). Evaluation of the technical adequacy of three methods for identifying specific learning disabilities based on cognitive discrepancies. School Psychology Review, 41, 3–22.