An earlier version of this essay appears in issue of educational Horizons, Summer 2008.
Classification Error in Evaluation
the impact of the "false positive" on educational practice and policy
©2008 Edward G. Rozycki
...we have not developed a formal way of reasoning probabilistically about this type of problem, ... clinical judgment may be faulty and ... current clinical practices may be inconsistent or incorrect. -- David M. Eddy 
Link to Part
II: The Capacity to Benefit
from Formal Academic Schooling:
two ideologies of distribution.
Part I: Introduction
A casual approach to the notion of test accuracy tends to conflate some very important distinctions: there is a difference between what practitioners call test sensitivity (or specificity) and what they call the positive predictive value of a test . The point of this essay is that the difference between the terms is very important -- especially at the level of educational policy -- and not merely a technician's fetish.
Test sensitivity expresses the percentage of persons who test positive for a characteristic out of all who truly have that characteristic. For example, if we know that 10% of a population have AIDS, then a 100% sensitive test would identify the infected 50 persons out of a population of 500 as positives, having AIDS.
Test specificity expresses the percentage of persons who test negative for a characteristic out of all who truly lack that characteristic. If specificity is 100%, then it would identify the non-infected 450 of the population given above as negatives, i.e. not having AIDS.
But no test is 100% sensitive (or specific). Consequently, it will generate a number of "false positives," persons who lack the characteristic looked for, but nontheless test positive and who, for practical purposes, initially, at least, will be indistinguishable from true positives. (There will also be "false negatives" who may suffer from misclassification.)
The positive predictive value (PPV) of a test is a measure of the probability that a person who tests positive for a characteristic actually has that characteristic.  In other words, can we trust the test? Given, for example, that a student has been test-identified as a drug user, how likely is it that that student is truly a user? Or, if a placement test indicates that a student is ready for instruction at a third grade level in mathematics, is that student truly ready to begin to progress from that starting point?
A Decision Simulation
Over several years of university classes I have asked my students, mostly principals and superintendents in a doctoral program, whether they would consider using random drug testing in their schools under the following conditions:
1. the Drug Test (for brevity's sake, let us call it DT) would correctly identify 19 out of 20 "true" drug abusers as positives; DT would correctly identify 19 out of 20 "true" non-abusers as negatives;[2b]
2. Suppose that there are, at most, five percent of a total of 10,000 K-12 students abusing controlled substances;
3. the costs of the first use of DT for each student would be borne by the manufacturer; only reapplications of DT, if there were any, would have to be paid for at the rate of fifty dollars each.
4. To be minimally intrusive on the educational program, but maintain a hoped-for deterrent presence, only one or, at most, a few students would be selected each day by lottery to be tested by DT.
Over the years the considerations these experienced educators would entertain and the conclusions they would reach were pretty much the same: the great majority would say yes to implementing a program of random testing, despite much they found objectionable about it.
After this first stage of reaching a decision, I would caution them that DT would generate what are called "false positives," students mistakenly identified by DT as abusers, for whom such a determination might have severe social consequences. With reluctance the educators would insist they would implement the use of DT, but not push for criminal charges against these young people. Rather they would provide counseling and therapeutic support to the test-identified-users.
Politically sensitive, the principals and superintendents thought that drug abuse by the postulated 500 students would be perceived by important members of their communities to be so alarming that it would be worth the risk of identifying some students falsely as positive. As administrators used to tight budgets, they found the no-cost offer particularly enticing: a fifty-dollar per student saving.
I pressed them on what they would do to handle possible false positives. After some discussion the general consensus would be this: have students identified as users retake the test. The test is 95% accurate, they argued, so if a student tested twice as a user, the probability was (.95x.95 = .9025) that the student really was a user. (There shouldn't be, they reasoned, very many retests needed -- maybe five percent, the incidence rate.) Concerns for nurture and humaneness would erupt in their repeated emphasis that their investigation with DT was not intended to be punitive.
Disregard for False Positives
The critical question for the practitioner is this: given that a student has been DT-identified as a drug user, what is the likelihood that that student is a "true" user? This is a question that invokes PPV. Thinking that the only consideration was the sensitivity of DT in identifying users, my educator-students were wildly wrong in their estimations. And they were very surprised by the correct answer, which completely reversed their previous conclusions.
Educators are not unique in their disregard of the effects of classification error. Decades ago, David M. Eddy found that in trying to evaluate a patient's symptoms, physicians believed that the prevalence a disease in a population need not be used to the estimate of the probability that a particular patient has the disease.  Gerd Gigerenzer, more recently, reviews a sequence of examples to show how even in the law, critical mistakes in reasoning about classification have been far from uncommon.
I explained to my educators that they were overlooking a very important thing, prevalence: the relative size of the user group to the entire population under consideration. Test sensitivity by itself was not sufficient to determine the probability that test-identified users were real users. Prevalence of abuse has a major influence.
Let's begin by reconsidering how DT sorts the students.
a. We have 10,000 K-12 students of whom, 5%, or 500, are drug abusers. There are 9500 non-abusers.
b. DT will correctly identify 19 out of 20 "true" abusers, that is, 95 out of 100 "true" abusers correctly as abusers. This group is traditionally called the "true positives" because they are "truly" abusers and will test "positive" on DT.
c. The other 5 "true" abusers are misclassified as non-users. They are "false negatives" because they will test "negative" on the DT, which is a "false" characterization of their "true" abuser status.
d. DT will also correctly identify 19 out of 20 "true" non-abusers, that is, 95 out of 100 "true" non-abusers correctly as non-abusers. This group is traditionally called the "true negatives" because they are "truly" non-abusers and will test "negative," non-user, on DT.
e. The other 5 "true" non-abusers are misclassified as users. They are "false positives" because they will test "positive" on the DT, which is a "false" characterization of their "true" non-abuser status.
But the number of non-users is vastly greater than the number of users. This is why prevalence has an effect on the probability of correctly identifying a "true" user. As far as DT is concerned, "true" users are indistinguishable from and will be confounded with false positives.
Let's do the arithmetic.
The Likelihood of Error
DT will split each ideal group, abuser and non-user in two in the proportion of 95 to 5. See the following chart:
Of "True" Users: 500
Of True Non-users:
475 true positives
475 false positives
25 false negative
9025 true negatives
Our problem now is decide whether a student identified as an abuser by DT is a "true" abuser, or a non-user misidentified by DT as an abuser. Members of the two positive groups are indistinguishable. The probability that we have a "true" positive, given that DT has identified a student as positive is the number of true positives divided by the number of all Test Positives, that is 475/(475+475). This equals 1/2.
In other words, we have a fifty-fifty chance of misidentification. My students found this objectionably high. Besides, it would require us to retest each student at the higher cost. That clinched it: they invariably reversed their decision to implement random testing for the specified situation.
Immediately below is a chart combining different possibilities of sensitivity and prevalence. It has been constructed with the restrictions given in the introductory problem, discregarding externalities. (See the next Section: Policy Issues) For pedagogical simplicity, sensitivity and specificity are assumed to be equal, (This assumption will be re-examined in Part II, "The Capacity to Benefit From Formal Academic Schooling.")
Although for dramatic effect in the simulation described above I chose values which gave a PPV outcome of .50, one can see on the chart that wherever sensitivity +prevalence = 1, e.g. sensitivity = .30 and prevalence = .70, PPV = .50.  Note well that even with very sensitive tests, that low prevalence yields practically prohibitively low PPV, i.e. 50% or less.
The coordinates outside the marginal entries are coordinated with three dimensional graphs of figure 1 found in the reference section. Deciles are color-banded. The width of band along the diagonal is inversely related to rate of change in PPV within that band.
Organizational Externalities: provenance, not just science is often the issue.
Some years back I conducted a graduate course that dealt using technology to implement policy. Many of my students who worked in research commented that they were "under pressure" often to "accentuate the positive and eliminate the negative" in the reports they made about their laboratory results. This was, for those whose organizations were sales-dependent or government grant-dependent, a matter of promotional strategy.
One student, who worked as a chemist in a company that tested blood samples for evidence of user drug abuse, objected to my presentation of the material above in this essay as plausibly correct. He asserted that he could guarantee the accuracy of the testing in his labs to an accuracy of .0001%.
I replied that I was willing to accept his laboratory outcomes to be of that precision. However, the issue as not about the blood or saliva samples he worked with, but with provenance of the samples, their "chain of evidence," so to speak. He was most likely in no position to guarantee that a sample he worked on had actually come from the person who it was supposedly linked back to. Those links were in many cases not difficult to get around or to substitute so as to disconnect a substance-abuser with a contaminated sample.
The protesting student was clearly unhappy with this reply. He got up and left. That was the last I saw of him.
For more on externalities, see Part II: The Capacity to Benefit from Formal Academic Schooling: two ideologies of distribution 
Identifying "True" Drug Users. Doesn't This Presume Anterior Testing?
Though they did not recognize its importance at the time, some of my more philosophically inclined students would press me as to what the terms "true users" or "true non-users could mean if DT was needed to identify users. What was supposed to be the different between a "true" user and a DT-identified user? Or between a "true" non-user, and a DT-identified non-user?
I would tell them that the prevalence, which compares "ideal" numbers of "true" users to "true" non-users, is often gotten from different aspects of "reality." In many fields, practitioners invoke a "gold standard"  or "priors"  which are widely treated as unchallengable fact, in order to make the comparison.
Lacking a gold standard we might get "true" numbers by postulation -- as in this exercise -- or by estimation from smaller samples of students using procedures too costly, or time-consuming to be applied to the whole populations. On occasion -- more frequent in education than generally acknowledged -- incidence rate will be conjured up, even, by guess, intuition or reference to tradition.
This is the second prong of our argument: policy matters can be misconceived not only be ignoring the effects of false positives on the estimation of sensitivity; but, also, by the manner in which a "gold standard" is chosen. Let us consider some more examples.
Triage is the practice of trying to get the "most bang for the buck." It is commonly associated with hospitals or battlefields in situations where resources are too scarce for optimal allocation. But any organization facing scarcities -- public schools, typically -- finds its decision-making more and more involved with the practice of triage.
The population to be "treated" is divided into three groups:
1. persons upon whom allocations would be wasted since they are not necessary, that is, superfluous, for them to meet certain goals, e.g. recovery from sickness or wounds;
2. persons upon whom allocations would be wasted since they are not sufficient -- even, perhaps in their entirety -- for them to meet certain goals;
3. persons who would use the allocations most effectively to meet certain goals.
Where medical supplies are scarce, practitioners will withhold them from those who don't need them since they are medically not in danger and, also, from those two far gone to be helped by them. Only those who can be reasonably expected to be helped will get them.
This pattern is common in other areas also. In education one common example of triage is called "teaching to the middle."  A teacher in an overcrowded classroom will spend less time with "bright" kids who don't need his assistance to understand the lesson, or with kids who could use up all his time and still not "get it." He will reasonably direct his attention to the group who he believes will show most benefit from his help.
A neighborhood football club low in cash will spend its equipment money neither on those who can afford their own equipment -- save, perhaps, for special team jerseys; nor, on those least likely to play regularly. Money goes to equip those needy members most likely to be regular players.
For our purposes we can treat the triage as composing only two groups: those "in-need" who are to receive treatment; and, those "no-need" who don't. We will have to devise a test, a means for sorting the groups out. But what is the "gold standard" prevalence? Where do the numbers come from? Where triage is concerned, very often perceived resources restrictions or political pressures dictate the numbers.
In the medical case there is an understandable bias toward supporting the professional judgment of practitioners. The very employment of triage depends upon their judgment that resources are scarce. Secondly, the sorting out into in-need versus no-need groups exercises that judgment. Is it likely that such decisions will be reviewed later to determine whether triage was needed after all, or whether group placements were truly accurate? Who could tell?
Public education offers far less sophisticated examples. For example, during the 1970's the School District of Philadelphia was forced to deal with its "special" populations, which, up to that time were being ignored. An initial testing determined that 70% of the students fell into the category, "in-need" of special services. The projected costs were far in excess of any conceivable, realistic budget. Examiners were told to "cap" the in-need group at 30%. [11b]
Let us suppose that the initial testing was based on the reality of what it would take to succeed in the classrooms the students would actually face. This is the gold standard. However, by administrative fiat, 40% of the in-need population was mixed back into the no-need group. School district researchers, wanting to remain employed, develop "measurement" devices that supported that fiat. (A director of research actually resigned rather than fiddle data.) One can easily conclude that the sensitivity of those "tests" -- not so much paper-and-pencil affairs as sorting procedures and policies -- had to be mind-bogglingly promiscuous.
Not surprisingly, the district adopted policies of social promotion and put pressure on high school teachers to give minimal grade point averages of 60 -- even if students did not attend classes -- so that the passing grade of 70 would not be hard to achieve, even if most of the semester's work was undone. 
Progressing upward from grade to grade in this manner generates a flood of curricularly inept "false positives" which now fills our colleges -- as suggested by the epidemic dimensions of cheating and plagiarism at all levels of educational endeavor.
Assessing "Yearly Adequate Progress via educational "growth"
The "false positive effect" impacts many other areas of educational practice.
Since high-achieving schools can meet any reasonable criterion of success with little effort, while those very low-achieving schools can raise test scores tremendously but still miss the criterion, many educational authorities have suggested that "value added" or "growth" criteria be the basis of school evaluation.
However, grade-level placement, especially in public schools, is a haphazard process. Placement tests tend to be seldom and not consistently given -- to avoid the cost. Tradition has it that children at age 6 are ready for 1st grade curriculum; at age 7, they will be ready for 2nd grade curriculum. (Those who invoke such tradition rarely know that until the middle of the 18th Century, Yale did not require knowledge of arithmetic for admission.)
There is little consensus on any educational "gold standard" that does not derive either from tradition or from the special interests of teachers and teacher educators to maintain a market for their services. In addition, parent demands can influence placement substantially.[14b] Thus, in effect, we have in practice in many schools a "placement test," consisting of a conglomerate of procedures based mainly on hope, tradition, special interest and organizational expedience.
If we suppose, for the sake of hypothesis -- our theoretical "gold standard" -- that each grade level in a particular school requires certain preparatory skills, dispositions and knowledge for success in the coming year, we can be quite certain that our haphazard admission processes will allow in quite a few "false positives;" that is, students who at the point of admission appear no less capable -- they walk, they talk, they can fog a mirror -- than the students who possess the skills, dispositions and knowledge to succeed at that level. The flood of incompetence mentioned earlier commences.
Consequently, assessing anything like "yearly adequate progress' via year-to-year average "growth" is, consequently, not only unfair to school staff, but, likely of minimal validity in measuring academic progress.
Link to Part II: The Capacity to Benefit from Formal Academic Schooling: two ideologies of distribution
 David M. Eddy, "Probabilistic reasoning in clinical medicine: Problems and opportunities" Chapter 18 pp. 249 - 267 in Daniel Kahneman, Paul Slovic & Amos Tversky (eds.) Judgment under uncertainty: Heuristics and biases. Cambridge University Press 1982. p. 267.
 For pedagogical simplicity, I posed the problem to my classes on the assumption that test sensitivity would be the same as its specificity. (Realistically, they should be each be treated as an independent random variable, a concept unfortunately too daunting for even advanced students in many American universities.)Depending on the sources across different disciplines as well as professions, one finds some variation in the definition of key terms. In this essay I will use the definitions for test-related terms given online by Family Practice Notebook.com at http://www.fpnotebook.com. These definitions, as well as alternatives are given in the chart below. The most important ones for the arguments in this essay are these:
a. test sensitivity: the percentage of true positives that are identified as test-positive; (or, test specificity, the percentage of true negatives that are identified as test-negative.)
b. positive predictive value: the ratio of true positives to the sum of true positives plus false positives; PPV = tp /(tp+fp). Compare
|FamPracNote||http://www.fpnotebook.com/Prevent/Epi/NgtvPrdctvVl.htm||=true neg/(true negs+false negs)||Neg Pred Val||ratio of true negatives to sum of true and false negatives|
|FamPracNote||http://www.fpnotebook.com/Prevent/Epi/PstvPrdctvVl.htm||=(true pos)/(true +false pos)||Pos Pred Val||ratio of true postives to sum of true and false positives|
|Wiki||http://en.wikipedia.org/wiki/Sensitivity_%28tests%29||PPV=tp(tp+fp)||Pos Pred Val||.|
|Wiki||http://en.wikipedia.org/wiki/Test_sensitivity||=tp/(tp+fn)||sensitivity||the probability that if the person has the disease, the test will be positive.|
|Johnson||http://www.unr.edu/homepage/jerryj/NNN/MedicalTesting.pdf||.||sensitivity||ratio of true positive to all test positives|
|FamPracNote||http://www.fpnotebook.com/PRE18.htm||.||sensitivity||true positives per total affected (positives)|
|Childrens||rapic-diaghttp://www.childrensmercy.org/stats/definitions/sensitivity.htm||Sn = TP / (TP + FN)||sensitivity||probability that the test is positive when given to a group of patients with the disease.|
|FamPracNote||http://www.fpnotebook.com/Prevent/Epi/TstSpcfct.htm||.||specificity||true negatives per total unaffected|
|FamPracNote||http://www.fpnotebook.com/Prevent/Epi/TstSpcfcty.htm||tn/(tn+fp)||specificity||= true neg/ unaffected tested|
|Johnson||http://www.unr.edu/homepage/jerryj/NNN/MedicalTesting.pdf||.||specificity||ratio of true negatives to all test negatives|
|Sp = TN / (TN + FP)||specificity||the probability that the test will be negative among patients who do not have the disease.|
There is a deeper issue. If sensitivity and specificity were different, then the proportions of true to false items within the same population would not -- within a reasonable range of error -- equal 100%. We might find try to substitute a test where, say the sensitivity test of the original test was matched by the specificity of the second test. But this might raise issues of construct validity. Since the specificities of the two tests are different, are the trues and not-trues of the first test the same kinds of items as the trues and not-trues of the second test?
If the two tests serve different vociferous interests,
this will make for political or organizational headaches; for example,
between those who worry that false positives of drug-testig will suffer
from prosecutorial overreach; and, those who worry that drug-using or
purveying felons will escape the wheels of justice.
[2b] This sensitivity is not artificially low. It yields a PPV similar to examples found in Gerd Gigerenzer in "Ecological Intelligence" (Chapter 4 in Gerd Gigerenzer Adaptive Thinking. Rationality in the Real World (2000) Oxford University Press) gives the following test characteristics:
a. low-risk group AIDS testing, sensitivity, s = 50%, Gigerenzer (2002) p.125; or
b. mammogram, 50% sensivity in Gerd Gigerenzer in Calculated Risks (2002) New York: Simon & Schuster, footnote 20, p 265.
An understandable reaction. A member of one of the classes that
participated in the simulation wrote a paper expressing his perspective on
the issue: see William J. McIlmoyle "Random Drug Tests for High School
The problem is that multiplying the probability of each instance to get the probability of the combined outcome assumes independent events. The two occasions of testing are not independent.
 Eddy (1982). See especially pp. 258 - 259.
 See Gerd Gigerenzer "Ecological Intelligence" Chapter 4 in Gerd Gigerenzer Adaptive Thinking. Rationality in the Real World (2000) Oxford University Press pp. 59 - 76. For an interesting exposition on statistical folderol in the drug industry, see R. Brian Attig & Allison Clabaugh, "Clinical Trials and Statistical Tribulations" in Applied Clinical Trials (February 2008) pp. 42 - 46 accessible at http://actmagazine.findpharma.com/appliedclinicaltrials
. I could not use zero in the marginals since I constructed the chart with Microsoft Excel and had to be concerned with division by zero. The formula for creating the chart from the marginal values is
a. PPV' = s*p/(s*p+((1-s)*(1-p)))
where PPV = positive predictive value of test under the special conditions of the initial problem example that the test's sensitivity be assumed equal to its specificity. This special formula contrasts with the standard formula
b. PPV = s*p/(s*p+((1-sp)*(1-p)))
where s = sensitivity, p = prevalence and sp = specificity. Note that if sp = s, the formulae are identical.
(See "Calculation of Positive Predictive Value" available at http://www.mas.ncl.ac.uk/~njnsm/medfac/MBBS/handout.pdf)
I am indebted to Professor Fredrik Nyberg for alerting me to my having failed to explain my deviant formulation and supplying me with one more contrastive with mine, formula b., to consider.For more on externalities, see Part II: The Capacity to Benefit from Formal Academic Schooling: two ideologies of distribution
 See Lucas M. Bachman et al "Consequences of different diagnostic 'gold standards' in test accuracy research: Carpal Tunnel Syndrome as an example" in International Journal of Epidemiology 2005; 34:953-955 available online as pdf at yudkodwhttp://ije.oxfordjournals.org/cgi/content/abstract/dyi105v1
See Eliezer Yudkowsky, "An Intuitive Explanation of Bayesian Reasoning" at
http://yudkowsky.net/rational/bayes/ for an entertaining and
insightful introduction to the topic, especially the "fun facts" on
priors: "Q: Where do priors originally come from? A: Never ask that
Also, for a very well constructed lesson on this Bayesian thinking in testing, see Jerry Johnson, Medical Testing at http://www.math.dartmouth.edu/~mqed/UNR/MedicalTesting/MedicalTesting.phtml
See, also, "How to Improve Bayesian Reasoning without Instruction" Chapter 6 in Gigerenzer (2000) pp. 92 - 123; or "Insight" Chapter 4 of Gigerenzer Calculated Risks (2002) pp. 39 - 51. For a quick overview of Gigerenzer's and Selten's heuristics types and comparison with standard computational methods see the "Types of Heuristics" chart available at
 The "reality" that gives us "true" measures is a very general concept deriving from some or all of the following:
a. the correlation among different kinds of tests applied to a given situation;
b. the convergence of (non-prejudiced) tests toward common limits;
c. our personal sense of continuity and correlation within our own experience;
d. the degree of consensus that exists within and across the different communities we participate in.
Each of these has it own fallibilities. See Edward G. Rozycki, "Philosophical Foundations of Human Cognition" at http://www.newfoundations.com/CogTheo/CogTheo1.html
 See Edward G. Rozycki, "The Ethics of Educational Triage" at http://www.newfoundations.com/EGR/Triage.html
[11b] For a saga of manipulating test populations in order to exaggerate test gains see E. G. Rozycki "Contracting a Real Performance" at http://www.newfoundations.com/EdBiz/PERFORMANCE.html
 To this day, officials of the school district and the organizations that parasitize it insist that the basic reason that school district students do not achieve well academically is mainly, if not solely, because teachers are not held accountable for student performance.
I was on a research team in 1979 for the School District of Philadelphia looking to find a placement test for ESOL students. After months of intensive review of many reasonable choices, our recommendation was overridden by the imposition of a quite inferior test by the School District Legal Department based on the fact that the use of that test was successfully defended in a lawsuit taken against the New York City schools. -- EGR
 See Gary K. Clabaugh & Edward G. Rozycki "Cheating Trends" available at lhttp://www.newfoundations.com/PREVPLAGWEB/CheatingTrends1.html
 See Ted Hershberg's opinion piece, "Follow growth, not achievement" Philadelphia Inquirer March 3, 2008. Standford University has released an important report on the negative effects of tracking "School Tracking Harms Millions, Sociologist Finds" available at http://news-service.stanford.edu/pr/94/940302Arc4396.html
[14b] The political influences on curriculum and testing tend to be grossly underestimated, even (one might say, "especially") by professionals in the field who tend to disregard perspectives and interests other than their own as "lacking objectivity." For an overview of this issue, see Gary K. Clabaugh & Edward G. Rozycki, "The Foundations of Curriculum" available at http://www.newfoundations.com/FdnsCurriculum.html
An important book dealing with the effects of externalities on research standards is Daston, L & Galison, P (2010) OBJECTIVITY New York. Zone Books.
Link to Part II: The Capacity to Benefit from Formal Academic Schooling: two ideologies of distribution