PSYCHOLOGY PAPER

PSYCHOLOGY PAPER There is a movement in psychology to require the usage of psychological measures to aid in the diagnosis of some mental illnesses. While developing the DSM-V, there was some discussion of including results of specific measures as part of diagnostic criteria. Considering the two articles you read, Limitations of Diagnostic Precision and Predictive Utility in the Individual Case: A Challenge for Forensic Practice and Diagnostic Utility of the NAB List Learning Test in Alzheimer’s Disease and Amnestic Mild Cognitive Impairment,write a response paper discussing the utility of using psychological measures as suggested above in the diagnosis of mental illness. ORIGINAL ARTICLE Limitations of Diagnostic Precision and Predictive Utility in the Individual Case: A Challenge for Forensic Practice David J. Cooke Æ Christine Michie Received: 24 August 2007 / Accepted: 2 February 2009 / Published online: 11 March 2009 American Psychology-Law Society/Division 41 of the American Psychological Association 2009 Abstract Knowledge of group tendencies may not assist accurate predictions in the individual case. This has importance for forensic decision making and for the assessment tools routinely applied in forensic evaluations. In this article, we applied Monte Carlo methods to examine diagnostic agreement with different levels of inter-rater agreement given the distributional characteristics of PCL-R scores. Diagnostic agreement and score agreement were substantially less than expected. In addition, we examined the confidence intervals associated with individual predictions of violent recidivism. On the basis of empirical findings, statistical theory, and logic, we conclude that predictions of future offending cannot be achieved in the individual case with any degree of confidence. We discuss the problems identified in relation to the PCL-R in terms of the broader relevance to all instruments used in forensic decision making. There is an important disjunction between the perspective of science and the perspective of the law; while science seeks universal principles that apply across cases, the law seeks to apply universal principles to the individual case. Bridging these perspectives is a major challenge for psychology (Faigman, 2007). It is recognized by statisticians that knowledge of group tendencies—even when precise— may not assist accurate evaluation of the individual case (e.g., Colditz, 2001; Henderson & Keiding, 2005; Rockhill, 2001; Tam & Lopman, 2003). It is a statistical truism that the mean of a distribution tells us about everyone, yet no one. This has serious implications for the use of psychological tests in forensic decision making. To illustrate these limitations, we focus on one of the most widely used, and perhaps the most extensively validated, test in the forensic arena—the Psychopathy Checklist Revised (PCL-R1; Hare, 2003). We emphasize, however, that all psychological tests used in the same way in the forensic arena will suffer from similar limitations (e.g., VRAG, Quinsey, Harris, Rice, & Cormier, 1998; Static-99, Hanson & Thornton, 1999; COVR, Monahan et al., 2005). Mental health professionals are frequently asked to opine whether an individual might be violent in the future; psychopathic personality disorder is an important risk factor to consider (Hart, 1998). The PCL-R is the most frequently used measure of psychopathic personality disorder; it has been described as the ‘‘gold standard’’ for that purpose (Edens, Skeem, Cruise, & Cauffman, 2001; as cited in Hare, 2003). There can be little doubt that the PCLR has made a major contribution to our understanding of violence (Hart, 1998); nonetheless, it is important for the field to consider both its strengths and its limitations. Findings for this instrument will have implications for less well-validated tools. In this introduction, we consider two issues; first, the use of PCL-R scores in forensic practice and second, the general problem of the precision of predictions about an individual case. D. J. Cooke (&) C. Michie Department of Psychology, Glasgow Caledonian University, Glasgow G4 0BA, UK e-mail: djcooke@rgardens.vianw.co.uk 1 The PCL-R is a 20-item rating scale of traits and behaviors intended for use in a range of forensic settings. Definitions of each item are provided and evaluators rate the lifetime presence of each item on a 3-point scale (0 = absent, 1 = possibly or partially present, and 2 = definitely present) on the basis of an interview with the participant and a review of case history information. 123 Law Hum Behav (2010) 34:259–274 DOI 10.1007/s10979-009-9176-x PCL-R SCORES AND FORENSIC PRACTICE Much of the interest in the construct psychopathy comes from the relationship between the PCL-R and future criminal behavior (Lyon & Ogloff, 2000). Previous research suggests that psychopathy—as assessed using the Psychopathy Checklist-Revised (PCL-R; Hare, 1991)—is an important risk marker for criminal and violent behavior (Douglas, Vincent, & Edens, 2006; Hart, 1998; Hart & Hare, 1997; Hemphill, Hare, & Wong, 1998; Leistico, Salekin, DeCoster, & Rogers, 2008; Salekin, Rogers, & Sewell, 1996). In fact, the PCL-R has been lauded as an ‘‘unparalleled’’ single predictor of violence (Salekin et al., 1996). Hart (1998) argued that failure to consider psychopathy in a violence risk assessment may constitute professional negligence. This empirical base has resulted in the PCL-R being used, not merely to measure the trait strength of psychopathy in an individual, but also to make predictions about what he or she will do in the future (Hare, 1993). As we demonstrate formally below, this additional step of prediction means that the potential for imprecision in forensic evidence is greatly increased: It expands the gulf between inferences about groups and inferences about individuals. The PCL-R has been incorporated into statutory or legal decision making (Hare, 2003). Within England and Wales, a PCL-R score above a cut-off of 25 or 30 can lead to detention in either a Special Hospital or a prison (Maden & Tyrer, 2003); in certain Canadian provinces parole boards explicitly consider PCL-R scores (Hare, 2003), and in Texas psychopathy assessments are mandated by statute for sexual predator evaluation (Edens & Petrila, 2006).2 The PCL-R plays a role in criminal sentencing, including decisions regarding indefinite commitment and capital punishment, institutional placement and treatment, conditional release, juvenile transfer, child custody, witness credibility, civil torts, and indeterminate civil commitment (DeMatteo & Edens, 2006; Fitch & Ortega, 2000; Hart, 2001; Hemphill & Hart, 2002; Lyon & Ogloff, 2000; Walsh & Walsh, 2006; Zinger & Forth, 1998). The PCL-R is regarded by many as the best method for operationalizing the construct of psychopathy. For example, Lyon and Ogloff (2000) argued that ‘‘…it is critical that the assessment is made using the PCL-R’’ (p. 166) when evidence about violence risk, based on psychopathy, is provided. Because of its central role in forensic decision making it is vital to assess its strengths and limitations and, by comparison, the limitations of less well-validated procedures. PREDICTIONS FOR INDIVIDUALS VERSUS PREDICTIONS FOR GROUPS Prediction is the raison d’eˆtre of many forensic instruments (e.g., VRAG, Quinsey et al., 1998; Static-99, Hanson & Thornton, 1999; COVR, Monahan et al., 2005). While this is not true of the PCL-R its frequent use in forensic practice is underpinned by the assumption—implicit or explicit—that it can predict future offending (Walsh & Walsh, 2006). How precise can such predictions be? The precision of any estimate of a parameter (e.g., mean rate of recidivism of a group) can be measured by the width of a confidence interval (CI); a CI gives an estimated range of values, which is likely to include an unknown population parameter. If independent samples are taken repeatedly from the same population, and a CI calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter. Typically, 95% of these intervals should include the unknown population parameter; other intervals may be used (e.g., 68% and 99%). The width of this interval provides a measure of the precision—or certainty— that we can have in the estimate of the population parameter. The width of a CI of a population parameter is linked, in part, to the sample size used to estimate the population parameter (see below for a more technical explanation). The prevailing prediction paradigm has two stages. First, the parameters (mean, slope, and variance) of a regression model linking an independent variable (e.g., PCL-R score) to a dependent variable (e.g., likelihood of reconviction) are estimated. Each of these parameters has uncertainty associated with them, which can be expressed by confidence bands about the regression line. Second, a new case is selected and the PCL-R score is assessed, the model is applied and the likelihood of reconviction is estimated. The best estimate of the likelihood of reconviction for a new case will be identical to the point on the regression line for that PCL-R score. This new estimate has a CI—also known as a prediction interval—that expresses the precision, or certainty, that should be associated with the prediction made about the new case. Often the two steps are conflated, with the unrecognized assumption being made that the prediction interval for the new case is comparable to the CIs for the model. It is not (see below). The problem of making predictions for individuals from statistical models is now recognized in other disciplines. In relation to medical risks, Rose (1992) expressed the position clearly: ‘‘Unfortunately the ability to estimate the average risk for a group, which may be good, is not matched by any corresponding ability to predict which individuals are going to fall ill soon’’ (p. 48). In relation to reoffending, Copas and Marshall (1998) made a related point ‘‘…the score is not a prediction about an individual [italics added], but an estimate of what rate of conviction 2 The PCL-R is the most commonly used instrument for assessing psychopathy in this setting (Mary Alice Conroy, Personal communication, 10 April 2007). 260 Law Hum Behav (2010) 34:259–274 123 might be expected of a group [italics added] of offenders who match that individual on the set of covariates used by the score’’ (p. 170) (see also Altman & Royston, 2000; Bradfield, Huntzickler, & Fruehan, 1970; Colditz, 2001; Elmore & Fletcher, 2006; Henderson, Jones, & Stare, 2001; Henderson & Keiding, 2005; Rockhill, 2001; Rockhill, Kawachi, & Colditz, 2000; Tam & Lopman, 2003; Wald, Hackshaw, & Frost, 1999). It is not generally recognized that a risk factor must have a very strong relative risk (i.e.,[50) if it is to have utility as a screening instrument at the individual level (Rockhill et al., 2000; see also Kennaway, 1998). However, others set the bar higher: A risk factor has to be extremely strongly associated with a disease within a population before it can be considered to be a potentially useful screening test. Even a relative odds of 200 between the highest and lowest fifths will yield a detection rate of no more than about 56% for a 5% false positive rate… (Wald et al., 1999, p. 1564). To put this in perspective, the relative risk for the association between lung cancer and smoking is between 10 and 15 (Rockhill et al., 2000), depending on the definition of exposure. The relative risk for the PCL-R and recidivism is something of the order of 3 for general recidivism and 4 for violent recidivism (Hare, 2003). Does the application of current forensic tools provide an adequate basis for testimony concerning the individual case? In this article, we attempt to answer this question by considering three issues pertaining to PCL-R data. How confident can clinicians and legal decision makers be, first, in the use of critical diagnostic cut-offs; second, in the numerical value of PCL-R scores; and third, in individual predictions of violent recidivism? We describe two studies. The first study addresses the accuracy of diagnostic decisions and the potential range of discrepancies between two raters. The second study addresses the accuracy of prediction of future violence in the individual case. The results have relevance beyond the PCL-R to the use of other psychometric instruments in forensic practice: The same limitations may apply to many forensic assessment instruments. STUDY ONE In the first study, we examined diagnostic accuracy, specifically the allocation of individuals around two critical cut-offs, i.e., around 30 and around 25; the first is the standard PCL-R cut-off for the diagnosis of psychopathy and the second, often adopted in the UK, has proven useful in that context including in decisions regarding treatment allocation (Hare, 2003). The inter-rater reliability figures presented in the PCL-R manual can be regarded as good (Nunnally & Bernstein, 1994); intraclass correlation coefficient for single ratings (ICC1) are estimated in some research studies as being above .80 (Male offenders = .86; Male forensic psychiatric patients = .88; Hare, 2003, Table 5.4).3 Edens and Petrila (2006) indicated that these are probably ‘‘best case’’ estimates and ‘‘real world’’ reliabilities may be substantially poorer.4 Murrie, Boccaccini, Johnson, and Janke (2008), in one ‘‘real world’’ study, demonstrated poor agreement (ICC1 = .39). These views and findings echo concerns expressed by Hare (1998), that while researchers take great pains to ensure reliability in their studies, the level of reliability achieved by individual clinicians remains unknown—and by implication—is likely to be poorer than published studies. Inter-rater reliability is not the only relevant consideration: Diagnostic precision is also influenced by the underlying distribution of test scores. Diagnostic precision is influenced by the location of the cut-off and the shape of the distribution of scores—both skewness and kurtosis. Estimates of the precision of a test score (e.g., standard errors of measurement, SEM) are weighted toward the mean of the distribution whereas cutoffs are generally located substantially above the mean. Item Response Theory (IRT) studies demonstrate that the measurement precision of the PCL-R—in terms of measurement information—falls toward the diagnostic cut-off (Cooke, Michie, & Hart, 2006); thus, the SEM estimated on the mean will provide an optimistic estimate of diagnostic precision. The SEM cannot be directly translated into estimates of precision of diagnosis because of the impact of the score distributions. Equally, it is not possible to estimate misclassification rates directly using ICC1 values; therefore, simulation approaches are required. Study one describes a simulation that examines the impact of unreliability on diagnostic accuracy. Method Monte Carlo studies allow the investigation of the properties of distributions and estimates of parameters where results cannot be derived theoretically (Mooney, 1997; Robert, 2004). Large numbers of simulated datasets can be 3 The estimates of reliability are frequently obtained by re-rating the same interview or with an observer simultaneously rating within an interview. This will tend to inflate reliability, but not validity, as the same information source is being used. 4 The case of THE PEOPLE, Plaintiff and Respondent, v. KURT ADRIAN PARKER, a Sexual Violent Predator ACT case highlights the variability that can emerge in some cases; five accredited experts furnished five PCL-R scores that ranged from 10 to 25. (Edens, John, Personal Communication, 22 May 2006). Law Hum Behav (2010) 34:259–274 261 123 created based on an explicit and replicable data-generation process. The effect of known features designed into the data, such as levels of inter-rater reliability, on outcomes, such as diagnostic precision, can be assessed. Multiple trials of procedures are carried out to allow precise estimation of outcomes. Mooney (1997) argued that Monte Carlo simulations could allow social scientists to test classical parametric inference methods and provide more accurate statistical models. In our view, this mainstream statistical technique is underused in forensic research. Materials We used Monte Carlo techniques based on distribution information from two datasets of PCL-R total scores: (1) data for North American Male Offenders (Table 9.1, Hare, 2003) and (2) data from UK prisoners (Cooke, Michie, Hart, & Clark, 2005).5 The first distribution, being the largest, probably provides the best estimate of the true distribution of scores underlying the PCL-R and is described as ‘‘approximately normal’’ (Hare, 2003, p. 55). Given the potential impact of a departure from normality we tested whether this distribution was in fact normal. The departure from normality was highly significant (Kolmogorov–Smirnov = .068, df = 5408, p\.0001, Skewness = -.33, Kurtosis = -.570). Examination of Fig. 1 demonstrates that around the standard cut-off of 30 cases are over-represented while in the right tail of the distribution they are under-represented. In the simulation study,we generated two randomvariables per case using MATHCAD.13 (2005). These random variableswere scaled according to one of the two datasets referred to above with mean l and standard deviation r.6 This gives two uncorrelated ratings (x1 and x2) from the distribution of scores: x1 is our first rating on the subject, PCL1. We then calculated a linear combination of the two ratings to provide a second rating on the same subject, which has a correlation of q with the first rating. The linear combination is PCL2 ¼ roundf l þ ðx1 lÞq þ ðx2 lÞ ffiffiffiffiffiffiffiffiffiffiffiffiffi 1 q2 p g ; using rounding to ensure an integer score. This process gives two random, correlated scores from the distribution. There is a very small probability of obtaining second ratings less than 0 or greater than 40: These scores have been taken as 0 or 40, respectively. Assuming that the ICC1 represents the best estimate of the correlation between the two scores, we estimated the distributions for four values of reliability, i.e., ICC1 values of .75, .80, .85, and .90. The .80 value is a lower bound estimate for reasonable practice. Hare (1998) indicated that at least this level should be achievable with ‘‘…properly conducted assessments’’ (p. 107). The .85 level may be achievable by one rater with good training; the .90 level, perhaps the best case scenario, is the level achievable where two independent sets of ratings are averaged. Values above .90 are rarely if ever achievable—Hare (1998) describes .95 and higher as ‘‘unbelievably high’’ (p. 107). The .75 provides a lower-bound estimate of what may be obtained in clinical practice. These values probably represent optimistic estimates for actual clinical practice; we did not assess the ‘‘worst case’’ scenarios implied by Edens and Petrila (2006) and by Murrie et al. (2008). The estimation procedure was repeated 10,000,000 times for each of the four levels of ICC1 to provide stable estimates of the distribution of the correlated ratings and to ensure at least 10,000 cases within each of the extreme score bands. We examine discrepancies in two ways: First, in terms of diagnostic disagreement and second, in terms of disagreements about total scores. What is the level of diagnostic agreement? Kappa (j) coefficients measure the proportion of diagnostic agreements corrected for observed base rates (Fleiss, 1981). Conventionally, j\.75 represents excellent agreement, .40\j\.75 represents fair to good agreement and j\.40 represents poor agreement (Gail & Benichou, 2000). Kappa values for three distributions and the four ICC1 values are given in Table 1. We calculated Kappa coefficients for agreement in diagnosis between the two ratings using both common cut-offs, i.e., 30 and 25. The vast majority of Kappa values are only in the fair to good range; few values approach the poor range. Kappa is an omnibus statistic, which is useful for summarizing group results; however, it tells us little about agreement in the individual case. The potential for misclassification is clearer when distributions of disagreements 0 10 20 30 40 PCL-R 0 100 200 300 Frequency Fig. 1 Distribution of North American male prisoners and normal curve 5 We carried out a similar analysis of data for Male Forensic Psychiatric Patients (Table 9.2, Hare, 2003); the results, which demonstrate the same pattern, can be obtained from the first author. 6 A full description of the simulation study including the Mathcad code can be obtained from the first author. 262 Law Hum Behav (2010) 34:259–274 123 are considered. The distributions based on the North American Male Offenders are in Table 2. For ease of interpretation, we tabulated the distributions in 5-point ranges. Examination of the sub-table for ICC1 = .80 indicates that if one rater gives a score between 30 and 34, i.e., just above the diagnostic cut-off then only in 46% of occasions—approximately half the time—will the other rater obtain a score within the same range. In 44% of the occasions, the second rater would place the individual below the critical cut-off. Even in the best case scenario, i.e., ICC1 = .90, if one rater gives a score between 30 and 34 then only in 60% of occasions will the other rater obtain a score within the same range. On 29% of occasions, the second rater would place the participant below the critical cut-off. The distributions based on the UK prisoners are in Table 3. Examination of the table for ICC1 = .80 indicates that if one rater gives a score between 30 and 34, i.e., just above the diagnostic cut-off then only in 39% of occasions will the second rater obtain a score within the same range. In 54% of the cases, the second rater would place the individual below the critical cut-off.As previously, even in the best case scenario, i.e., ICC1 = .90, if one rater gives a score between 30 and 34 then only in 53% of cases will the other rater obtain a score within the same range. In 39% of cases, the second rater would place the participant below the critical cut-off. In the UK, the cut-off of 25, as well as 30, is often applied (DSPD Programme, 2005; Hare, 2003). Examination of the table for ICC1 = .80 indicates that if one rater gives a score between 25 and 29, i.e., just above theUKdiagnostic cut-off, then only in 29% of occasions will the other rater obtain a score within the same range. On 49% of occasions, the second rater would place the individual below the critical cutoff. Even in the best case scenario, i.e., ICC1 = .90, if one rater gives a score between 25 and 29 then only in 37% of cases will the other rater obtain a score within the same range. On 37% of occasions, the second rater would place the participant below the critical cut-off. Therefore, in broad terms, all of the findings reported above demonstrate that the allocation of an individual above or below diagnostic cut-offs is much less precise than previously thought. Another way of considering the precision of PCL-R scores is to examine expected discrepancies in scores based on variations in ICC1 while taking into account the distributional characteristics of the PCL-R scores. The PCL-R manual suggests that in 68% of cases the discrepancies between two raters should be up to 3 points, and in 95% of cases it should be up to 6 points (Hare, 2003). This assumes normality of the PCL-R score distribution, an assumption that is not met (see above). The cumulative distribution of score discrepancies estimated from the Monte Carlo studies are tabulated in Table 4. With the North American prisoner sample and an ICC1 of .80, a discrepancy of between 8 and 9 points would be expected in 9% of cases, around 10 points in 5% of cases, and between 12 and 13 points in 1% of cases. With the UK prisoner sample, and an ICC1 of .80, a discrepancy of between 8 and 9 points would be expected in 23% of cases, around 10 points in 5% of cases, and around 12 points in 1% of cases. An alternative approach to summarize the range of possible discrepancies is to estimate the distribution of a 2nd PCL-R rating given the 1st PCL-R rating. This conditional distribution can be summarized by a CI that contains 95% of the 2nd ratings. This interval is thus defined by the lower and upper limits LL and UL given by Prob(LL\2nd rating\ULj1st rating) ¼ 0:95: Results for both 68% and 95% CIs for ICC1 = .80, and for both samples, are presented in Table 5. For example, in the North American prisoner sample, if rater one obtains a total score of 30, then the 95% CI for rater two’s total score will be between 19 and 36 (i.e., between the 35th and 99th percentile). All the estimates in this study are conservative; that is, they assume that the SEM that applies at the mean applies Table 1 Kappa coefficients and levels of agreement for four levels of correlation (q) for two distributions q Both\30 Both C 30 Different j Both\25 Both C 25 Different j North American male offenders 0.75 72.9 10.6 16.4 .46 48.0 29.6 22.4 .54 0.80 73.3 11.5 15.1 .51 48.9 31.0 20.0 .59 0.85 74.4 12.2 13.5 .56 50.5 32.3 17.2 .64 0.90 75.5 13.4 11.1 .64 51.7 34.2 14.1 .71 United Kingdom prisoners 0.75 91.8 2.1 6.1 .38 79.8 7.6 12.6 .47 0.80 92.0 2.4 5.6 .43 83.1 6.8 10.2 .52 0.85 92.4 2.7 5.0 .49 81.2 9.1 9.7 .60 0.90 92.7 3.1 4.2 .57 82.0 10.1 7.9 .67 Law Hum Behav (2010) 34:259–274 263 123 around the cut-off. However, this is an unwarranted assumption. The overall variance of errors of measurement is a weighted average of the errors that pertain across the range of true score values. Precision of measurement of the PCL-R drops as scores approach the diagnostic cut-off (e.g., Cooke & Michie, 1997; Cooke et al., 2006). Thus, the degree of diagnostic misclassification and score discrepancy is likely to be greater in practice than demonstrated in the simulation above. The conditional SEM (CSEM)7 is the square root of the variance of errors at a particular level of true scores. To Table 2 Distribution of diagnostic disagreements by four levels of correlation between raters based on distribution of North American male offenders PCL-R score 0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–40 q = 0.75 0–4 .209 .086 .033 .006 0 0 0 0 5–9 .505 .305 .200 .079 .008 0 0 0 10–14 .245 .318 .250 .188 .072 .005 0 0 15–19 .041 .235 .266 .270 .239 .078 .002 0 20–24 0 .053 .192 .256 .310 .281 .097 0 25–29 0 .002 .056 .155 .236 .369 .371 .130 30–34 0 0 .003 .044 .121 .230 .445 .551 35–40 0 0 0 .001 .014 .037 .085 .319 q = 0.8 0–4 .245 .096 .032 .002 0 0 0 0 5–9 .523 .372 .211 .065 .003 0 0 0 10–14 .217 .315 .288 .198 .055 .001 0 0 15–19 .014 .198 .280 .306 .240 .055 0 0 20–24 0 .019 .162 .269 .340 .283 .063 0 25–29 0 0 .026 .137 .247 .391 .379 .076 30–34 0 0 0 .023 .106 .237 .465 .585 35–40 0 0 0 0 .010 .031 .093 .339 q = 0.85 0–4 .285 .105 .021 0 0 0 0 0 5–9 .552 .423 .217 .038 0 0 0 0 10–14 .158 .328 .333 .202 .028 0 0 0 15–19 .005 .141 .303 .354 .231 .024 0 0 20–24 0 .003 .121 .287 .386 .265 .029 0 25–29 0 0 .005 .111 .266 .430 .351 .038 30–34 0 0 0 .008 .085 .252 .519 .569 35–40 0 0 0 0 .003 .029 .101 .393 q = 0.9 0–4 .361 .103 .009 0 0 0 0 0 5–9 .578 .486 .215 .012 0 0 0 0 10–14 .061 .348 .390 .190 .010 0 0 0 15–19 0 .063 .319 .422 .216 .005 0 0 20–24 0 0 .067 .304 .449 .238 .004 0 25–29 0 0 0 .071 .267 .500 .289 .006 30–34 0 0 0 0 .056 .239 .609 .494 35–40 0 0 0 0 0 .018 .098 .501 The tables show column percentages, which sum to 1 within rounding error. The rows therefore do not sum to 1 Table 3 Distribution of diagnostic disagreements by four levels of correlation between raters based on distribution of UK prisoners PCL-R score 0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–40 q = 0.75 0–4 .395 .158 .064 .020 .002 0 0 0 5–9 .477 .352 .228 .107 .035 .002 0 0 10–14 .128 .332 .309 .219 .131 .032 0 0 15–19 0 .153 .271 .336 .265 .198 .041 0 20–24 0 .005 .119 .215 .316 .297 .266 .031 25–29 0 0 .009 .086 .162 .261 .291 .254 30–34 0 0 0 .017 .083 .191 .328 .529 35–40 0 0 0 0 .006 .020 .075 .186 q = 0.8 0–4 .438 .163 .059 .011 0 0 0 0 5–9 .485 .387 .232 .097 .014 0 0 0 10–14 .077 .354 .328 .229 .111 .016 0 0 15–19 0 .095 .296 .354 .289 .154 .023 0 20–24 0 0 .083 .237 .331 .321 .214 .007 25–29 0 0 .002 .067 .176 .287 .303 .224 30–34 0 0 0 .005 .077 .200 .387 .545 35–40 0 0 0 0 .002 .022 .073 .224 q = 0.85 0–4 .470 .168 .042 .003 0 0 0 0 5–9 .474 .437 .224 .069 .003 0 0 0 10–14 .056 .342 .375 .226 .075 .003 0 0 15–19 0 .053 .308 .404 .287 .106 .001 0 20–24 0 0 .050 .251 .383 .325 .140 0 25–29 0 0 0 .047 .194 .321 .328 .145 30–34 0 0 0 0 .058 .229 .447 .578 35–40 0 0 0 0 0 .017 .084 .277 q = 0.9 0–4 .530 .169 .020 0 0 0 0 0 5–9 .450 .501 .219 .033 0 0 0 0 10–14 .019 .312 .457 .207 .037 0 0 0 15–19 0 .018 .286 .489 .275 .050 0 0 20–24 0 0 .017 .252 .450 .318 .062 0 25–29 0 0 0 .018 .214 .367 .325 .046 30–34 0 0 0 0 .024 .257 .528 .587 35–40 0 0 0 0 0 .008 .085 .367 7 Professional standards indicate that the CSEM is an important piece of information that should be provided in a test manual. For example, Standard 2.14 ‘‘Conditional standard error of measurements should be reported at several score levels if constancy cannot be assumed. Where cut scores are specified for selection or classification, the standard errors of measurement should be reported in the vicinity of each cut score.’’ (American Educational Research Association/ American Psychological Association, 1999; p. 35 emphasis added). 264 Law Hum Behav (2010) 34:259–274 123 evaluate the true level of agreement of diagnosis likely to apply around a cut-off it is necessary to take the CSEM into account. Item Response Theory indicates that the error of measurement varies with location on the trait (h). IRT gives SEðhÞ ¼ 1 ffiffiffiffiffiffiffiffi IðhÞ p where I(h) is the information at h. CTT gives SEM ¼ SD ffiffiffiffiffiffiffiffiffiffiffi 1 q p Let q1 be the correlation at location 1 (h1), q2 be the correlation at location 2 (h2). Then q2 ¼ 1 ð1 q1Þ Iðh1Þ Iðh2Þ Location 1 is h = 0.0 (PCL-R = 20) and Location 2 is h = 1.0 (PCL-R = 30) (Approximate locations from Hare, 2003; Fig. 6.6; see also Cooke & Michie, 1997). Overall, the impact of the location of the estimated ICC1 is limited, dropping—at a maximum—from .75 to .69. However, as noted above, even small drops in ICC1 (e.g., from .85 to .80) can substantially affect the misclassification rate and the range of likely score discrepancies (see Table 6). It is noteworthy that the magnitude of the drop appears to be proportionately larger the poorer the mean estimated level of inter-rater reliability. This suggests that the effect of the CSEM is larger in cases that start with a relatively poor level of inter-rater reliability. Equally, this would suggest that proportionately greater discrepancies would, in general, be obtained when factor or facet scores are considered because they have lower levels of reliability than the total scores (Hare, 2003). STUDY TWO The use of the PCL-R in court is frequently justified based on its predictive utility, the support being garnered from between-subject designs (Edens & Petrila, 2006; Hare, 2003; Walsh & Walsh, 2006). In this study, we are concerned with the individual. We examine the confidence that can be placed in a prediction that an individual with a particular PCL-R score will be reconvicted for a violent offence. All measurements and estimates entail error. As noted above, the degree of error is expressed by CIs. For Table 4 Cumulative distribution of expected discrepancies between two raters for different levels of correlation based on two sample distributions Point discrepancy SEMa North American male offenders United Kingdom prisoners Correlation Correlation 0.75 0.80 0.85 0.90 0.75 0.80 0.85 0.90 0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1 .741 .934 .927 .919 .901 .936 .928 .918 .900 2 .503 .804 .785 .758 .704 .804 .785 .758 .703 3 .317 .679 .647 .595 .518 .681 .647 .595 .518 4 .303 .560 .516 .453 .352 .561 .518 .454 .357 5 .095 .449 .397 .327 .217 .452 .400 .332 .222 6 .046 .349 .292 .215 .121 .353 .298 .221 .125 7 .020 .262 .205 .136 .058 .269 .212 .141 .063 8 .007 .190 .137 .079 .023 .196 .142 .083 .026 9 .002 .132 .086 .041 .007 .137 .090 .044 .009 10 .001 .088 .050 .018 .002 .091 .054 .021 .003 11 .055 .026 .007 .058 .030 .009 .001 12 .032 .012 .002 .035 .015 .004 13 .017 .005 .019 .007 .001 14 .008 .002 .010 .003 15 .003 .005 .001 16 .001 .002 17 .001 a This column shows the cumulative distribution of discrepancies which was calculated assuming that discrepancies between two raters are normally distributed and that the SEM is 3 (Hare, 2003, pp. 66–67) Law Hum Behav (2010) 34:259–274 265 123 example, while the mean rate of reoffending for a ‘‘High Risk’’ group may be estimated as being 55%; the 95% CI indicates that the true value of the mean rate or reoffending for this group will lie between 44% and 66%, 95% of the time, i.e., 19 times out of 20 (Hart, Michie, & Cooke, 2007). However, the clinician and the decision maker are interested in the individual case not the group. Therefore, how much confidence can the clinician and decision maker have in predictions of reoffending in the individual case based on PCL-R scores? We examine CIs for group and individual predictions. Participants Two hundred fifty-five male prisoners between 18 and 40 years of age (M = 26.8, SD = 5.9) were interviewed in Scotland’s largest prison for a study of psychological characteristics and violence (Cooke, Michie, & Ryan, 2001; Michie & Cooke, 2006). Prisoners were selected by systematic random sampling of the prison. The average sentence length was 39 months (SD = 23 months; range = 3 months to 10 years and life). PCL-R Ratings PCL-R ratings were made according to instructions in the test manual (Hare, 1991). All PCL-R evaluations were conducted by trained raters using both interview and file review (ICC1 = .86). Assessment of Recidivism Reconviction data were obtained from two sources: The Scottish Criminal Records Office (SCRO) and the Police National Computer (PNC). The average follow-up period was 29 months. The point-biserial correlation between PCL-R scores and recidivism (r = .31) was above average for the field (Walters, 2003). For the purposes of illustration, we consider reconviction for violence that resulted in a prison sentence (i.e., generally a more serious violent offence). Follow-up data were available for 190 cases and PCL-R data for 184 of these. Table 5 The 68% and 95% confidence intervals for 2nd PCL-R total score given 1st PCL-R score and ICC = 0.8 1st PCLR Prisoners UK LL .95 LL .68 UL .68 LL .95 LL .95 LL .68 UL .68 LL .95 0 0 0 9 12 0 0 8 13 1 0 0 10 13 0 0 9 14 2 0 1 11 14 0 0 10 15 3 0 2 12 15 0 1 11 16 4 0 3 12 15 0 1 12 16 5 0 4 13 16 0 2 12 17 6 0 4 14 17 0 3 13 18 7 0 5 15 18 1 4 14 19 8 1 6 16 19 2 5 15 20 9 2 7 16 19 2 5 16 20 10 3 8 17 20 3 6 16 21 11 4 8 18 21 4 7 17 22 12 4 9 19 22 5 8 18 23 13 5 10 20 23 6 9 19 24 14 6 11 20 23 6 9 20 24 15 7 12 21 24 7 10 21 25 16 8 12 22 25 8 11 21 26 17 8 13 23 26 9 12 22 27 18 9 14 24 27 10 13 23 28 19 10 15 24 27 10 13 24 28 20 11 16 25 28 11 14 24 29 21 12 16 26 29 12 15 25 30 22 12 17 27 30 13 16 26 31 23 13 18 28 31 14 17 27 32 24 14 19 28 31 14 17 28 32 25 15 20 29 32 15 18 28 33 26 16 20 30 33 16 19 29 34 27 16 21 31 34 17 20 30 35 28 17 22 32 35 18 21 31 36 29 18 23 32 35 18 21 32 36 30 19 24 33 36 19 22 32 37 31 20 24 34 37 20 23 33 38 32 20 25 35 38 21 24 34 39 33 21 26 36 38 22 25 35 40 34 22 27 36 39 22 25 36 40 35 23 28 37 40 36 24 28 38 40 24 27 37 40 37 24 29 39 40 25 28 38 40 38 25 30 40 40 26 29 39 40 39 26 31 40 40 40 27 32 40 40 Table 6 Values of conditional standard error of measurement at diagnostic cut-off of 30 for different values of SEM and distributions of the three samples SEM q1 CSEM q2 North American male prisoners 0.75 0.70 0.80 0.76 0.85 0.82 0.90 0.88 United Kingdom prisoners 0.75 0.67 0.80 0.74 0.85 0.80 0.90 0.87 266 Law Hum Behav (2010) 34:259–274 123 Analysis There are standard methods for estimating CIs for groups; however, methods for estimating CIs for the individual case are not generally covered in the standard statistical texts used in psychology and they may, we suspect, be unfamiliar to the majority of psychologists. We explicate the method here. First, we consider the general case of CI estimation before considering the specific approach based on linear logistic regression used for our analysis. Any CI has the general form: Estimate t (Estimate of Standard error) where t is the Student’s t-statistic with the appropriate degrees of freedom. Suppose we are interested in a single variable, e.g., x = IQ, and have taken a sample of size n (x1; x2; . . .; xn) to estimate the mean and variance of IQ in the population of interest, then the sample mean ðxÞ is the estimate of the population mean. The accuracy of this estimate is given by a CI x tn ffiffiffiffi s2 n r where s is an estimate of the standard deviation of x. Suppose we are now interested in predicting the next observation in the population, xn?1. Then a CI for the prediction (i.e., the prediction interval) is given by x tn ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s2 1 þ 1 n s : Note that the estimate of both the mean and the prediction is x but that the prediction interval is (much) wider than the CI for the mean. Note also that the size of the sample from which the model was derived has little influence on the width of the prediction interval. In the linear regression situation, we have a sample of n pairs of observations (ðx1; y1Þ; ðx2; y2Þ; . . .; ðxn; ynÞ) from which we estimate the intercept and slope of the line by B0 and B1 in the usual way. The accuracy of estimation of the line would be given by the CI for the mean y for a given x. This is calculated in the standard manner (Steel, Torrie, & Dickey, 1997). yL; yU ¼ B0 þ B1x tn ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ^r2 1 n þ ðx x2Þ SSðXÞ s If we have a new case for which we know the x-value, xn?1 and wish to predict the y-value, this is given by ^ynþ1 ¼ B0 þ B1xnþ1 which is the mean value of y for the given x. The CI for this prediction (i.e., the prediction interval) is given by yL; yU ¼ B0 þ B1xnþ1 tn ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ^r2 1 þ 1 n þ ðxnþ1 x2Þ SSðXÞ s Again this prediction interval is much wider than the interval for the line and is not influenced to any significant degree by the size of the sample from which the model was developed. The square root term is the standard error of the predicted value. Here, the expression in brackets takes into account three sources of error. The first is the variability in participants, the second is the error in the estimate of variance ð^r2Þ; and the third allows for the fact that the error in prediction varies with distance from the mean PCL-R score. Linear logistic regression is the appropriate method for modeling the prediction of a binary outcome (e.g., reconviction). In linear logistic regression the model is given by PrðeventÞ ¼ 1 1 þ eZ where Z ¼ B0 þ B1x We have a linear regression of Z on x so the equation for the CI for Z is the same as the linear regression case. A prediction interval for Z for a new individual from the same population with score x0 can be constructed by ZL and ZU (lower and upper values, respectively) from the equation ZL; ZU ¼ B0 þ B1x0 tn ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ^r2 1 þ 1 n þ ðx0 xÞ2 SSðXÞ ! v u u t and the interval is then transformed in the following manner. Since the probability is a monotonic function, the prediction interval for the probability is given by 1 1 eZL ; 1 1 þ eZU (note, once again, that the size of sample on which the model was developed has little influence on the width of the prediction interval). Initially, we estimated the linear regression of Z on the PCL-R total score (Fig. 2), and then estimated the mean probability of reconviction by PCL-R score (Fig. 3). This is a monotonic function with the probability of reconviction accelerating with increasing PCL-R score. Those with an average PCL-R score (i.e., 12.5) had a 14% probability of being reconvicted for a violent crime and being sentenced to prison. Examination of the 95% CIs for the estimate of the mean rate of reconviction indicated that for an average PCL-R score of 12.5 the true probability of reconviction was between 10% and 20% (19 times out of 20) and for a PCL-R score of 25 the 95% CI was 18–54%. For a score of Law Hum Behav (2010) 34:259–274 267 123 30, the 95% CI was 21–70% demonstrating that 95% CIs generally widen the further scores are from the mean. Fundamentally, however, within the clinical or judicial context, the individual—not the group—is the focus of decision making. Therefore, we estimated the CIs for the likelihood that an individual would be reconvicted using the method outlined above. For an individual with a mean PCL-R score of 12.5, the best estimate was that he will reoffend 14% of the time; but this estimate was very imprecise because the 95% CI was 0–98%, i.e., the true value of the prediction would lie in this range 95% of the time. For an individual with a score of 25, the 95% CI was 0–99% and for an individual with a score of 30 the 95% CI was 0–99.5%. To illustrate the extent of the uncertainty associated with an individual prediction, we calculated the probability density function associated with a prediction that an individual with a PCL-R score of 25 would return to prison within 2 years for a violent crime (point estimate .33). The probability function is a means of describing degree of uncertainty. It can be viewed as a smoothed version of a histogram depicting relative frequencies of the range of probabilities of reoffending consequent on the variability in the original sample. Figure 4 displays a relatively flat probability density function around the point estimate of .33, with values ranging from 0 to 1.0, indicating that a broad range of values is likely in any individual case. One anonymous reviewer made the compelling case that a more liberal definition of harmful behavior that included more forms of offending should be considered. We carried out the same analyses with three other outcome variables, i.e., convictions for any crime or offence within 2 years of release; any convictions leading to incarceration and any conviction for a violent crime over the same time period. Figure 5 displays results for any convictions within 2 years of release and reveals the same pattern as for violent crime: The CI associated with the regression line being much narrower than the prediction interval.8 Another anonymous reviewer suggested that our results might be due to sample size (this is highly unlikely given the mathematical basis for the analysis, see above) or because of where the sample was drawn. We carried out further analyses to clarify these points using data from the MacArthur study of Mental Disorder and Violence (Monahan et al., 2001). Psychopathy as measured using the Psychopathy Checklist: Screening Version (PCL:SV; Hart, Cox, & Hare, 1995) was the strongest risk factor for future violence in that study, i.e., violence in the 20 weeks following discharge; the sample size with PCL:SV ratings was over four times the Scottish sample (n = 860). The pointbiserial correlation between PCL:SV scores and recidivism was similar to the equivalent correlation in the Scottish sample (r = 0.34, cf. r = 0.31). Figure 6 displays essentially the same pattern as the Scottish data: The slope is similar in shape to the Scottish curve, a slowly accelerating curve with risk of violence increasing with PCL:SV score. As would be expected from consideration of the equations above increasing the sample size (among other things) has resulted in a narrower CI around the regression line. But critically—as would be expected from the mathematics— increasing the sample size does not result in a narrower prediction interval. Perhaps a nonpsychological example may facilitate explanation. Given someone’s height, how well can we predict his or her weight? This example has several advantages for the purpose of illustrating the pervasive nature of the problem of predicting in the individual case. First, the reliability of the measurement of height and 0 8 16 24 32 40 PCL Total Score 0 3 6 9 -3 -6 -9 Z Z Mean Prediction Fig. 2 Group and individual CIs for linear regression of Z on PCL-R Total Score 0 8 16 24 32 40 PCL-R total score 0 0.2 0.4 0.6 0.8 1 Probability of recidivism Probability Mean Prediction Fig. 3 Group and individual CIs around of prediction of violent reoffending resulting in return to prison based on PCL-R score 8 Detailed descriptions of the results for the additional three outcome variables can be obtained from the first author. 268 Law Hum Behav (2010) 34:259–274 123 weight should be substantially higher than the measurement of either psychopathy or violent behavior. Second, the prediction is immediate and not degraded by the passage of time. Third, the relationship between height and weight is stronger than that between psychopathy and violent behavior. How well can we predict in the individual case under these more benign conditions? We carried out a Monte Carle simulation9 based on two sets of findings: The height of males in the UK is normally distributed (Guilford, Rona, & Chinn, 1992); and the relationship between height and weight can be assumed to be linear (Hawthorne, Murdoch, & Womersley, 1979). Figure 7 presents the linear regression of weight on height with a sample of 2000. The CI of the regression line is very narrow; however, when the prediction interval is calculated for an individual it is very wide. For example, for an individual of average height (i.e., 1.75 m) his predicted weight would be 81.5 kg but the prediction interval is between 61.3 and 101.8 kg—a range of around 40 kg. In conclusion, the results demonstrate that PCL-R (and PCL:SV) scores provide little reliable information about the likelihood that an individual will reoffend violently.10 This is not a problem peculiar to the PCL-R but will reflect individual variability on any scale (e.g., VRAG, Quinsey et al., 1998; Static-99, Hanson & Thornton, 1999; COVR, Monahan et al., 2005). DISCUSSION One broad conclusion can be drawn from these two studies: Clinicians must be extremely cautious in what they claim 0 8 16 24 32 40 PCL-R total score 0 0.2 0.4 0.6 0.8 1 Probability of reconviction Mean Prediction Prob Fig. 5 Group and individual CIs around of prediction of any reoffending based on PCL-R score 0 5 10 15 20 PCL:SV 0.0 0.2 0.4 0.6 0.8 1.0 Probability Probability of violence Confidence interval for line Confidence interval for prediction Fig. 6 Group and individual CIs around the prediction of violence in the 20 weeks following discharge: Data from MacArthur Study of Mental Disorder and Violence 0 0.2 0.4 0.6 0.8 1 Probability of recidivism 0 0.03 0.06 0.09 0.12 0.15 Probability density Fig. 4 Probability density function of the probability of a return to prison on conviction of a violent crime for a PCL-R score of 25 9 Following Guilford et al. (1992), the height of adult males in the UK between 1973 and 1988 was shown to be normally distributed with a mean of approximately 1.75 m (SD = 0.07). The relationship between weight and height for men aged 40–59 can be shown to be linear with the relationships for nonsmokers being Weight = 82.7 Height = 63.4 with r = 0.50 (Hawthorne et al., 1979). A sample of 2,000 pairs of height and weight were generated using Mathcad. For each subject, a height H was generated from a N(1.75, 0.49) distribution. A predicted weight was then calculated; a weight W was generated by adding an error from a N(0,98.4) distribution. The correlation between height and weight for this sample of 2,000 cases did not differ significantly from that reported in Hawthorne et al. (1979) (0.50 vs. 0.499). The linear regression of weight on height together with the CI for the regression line and the prediction interval for an individual whose height was known were calculated and presented in Fig. 7. Linear regression rather than logistic regression was used because both variables are continuous. The basic distinction between confidence intervals and prediction intervals remains the same. 10 We do not consider additional issues that would add ‘‘noise’’ into the system including recalibration of the PCL-R in a new jurisdiction in terms of the metric equivalence—or otherwise—of the scores (Cooke et al., 2005), the differences of reliability in clinical practice against research settings, and variations in the predictive validity of the PCL-R in a setting where detection and conviction rates may be different, etc. Law Hum Behav (2010) 34:259–274 269 123 regarding diagnoses, numerical scores, and risk potential of individual clients based merely on a PCL-R score. First, allocation above and below key diagnostic cut-offs (i.e., 30 or 25 on the PCL-R) is subject to far greater variability than previously demonstrated. Second, the precision of numerical scores is less than previously considered. Third, the clinician can have little confidence in statistical predictions regarding an individual’s likelihood of future offending based on a PCL-R score or the scores of violence risk assessment instruments. Fourth, the concatenation of these two sources of imprecision—score and predictive—is likely to further intensify uncertainty about what any one individual will do in the future. We emphasize again that these problems are not unique to the PCL-R: The shape of underlying score distributions will influence the precision of any scores estimated or any diagnoses derived. Statistical predictions about individuals will always be poor (Hart et al., 2007). As noted above, all psychological tests used in the same way in the forensic arena may suffer from similar limitations (e.g., VRAG, Quinsey et al., 1998; Static-99, Hanson & Thornton, 1999; COVR, Monahan et al., 2005). Neither are these problems unique to psychology. They bedevil—as our height and weight example demonstrates— any attempts to use group data to predict individual outcomes accurately, whether the outcome is, for example, heart attacks, cancer, juvenile delinquency, or recidivism (Copas & Marshall, 1998; Elmore & Fletcher, 2006; Rose, 1992; Scott, 2003). The problem reflects inherent human variation. There are perhaps two broad findings to note when it comes to considering the precision of our estimates of trait strength (or indeed, diagnosis). First, the use of aggregate statistics (e.g., Kappa or ICC1) to measure agreement, or to infer precision of our measurement processes, can obscure clinically important imprecision at the level of the individual. Second, untested assumptions (e.g., that scores are normally distributed) can be misleading when it comes to estimating the precision of our estimates. The findings from the Monte Carlo study (Study 1, described above) whether expressed in terms of diagnostic agreement, score disagreement, or range of score discrepancies may provide some explanation for the growing evidence of clinically significant discrepancies in PCL-R ratings in forensic settings (Boccaccini, Turner, & Murrie, 2008; Edens & Petrila, 2006; Murrie et al., 2008; Murrie, Boccaccini, Turner, et al., in press). Ethical forensic practice requires practitioners to maximize their reliability. There are no panaceas but four steps may assist. The first step is ongoing education and training, not only regarding the research base of tests and measures used in forensic practice, but also regarding advanced clinical skills. Advanced clinical skills would include techniques for interviewing these challenging individuals to ensure the collection of relevant information; these skills would also include techniques for generating case formulations to ensure the appropriate application of the information collected (Cooke, 2008, 2009a; Logan & Johnstone, 2008). The second step is ensuring the availability of comprehensive file information. The quality of file information influences both the magnitude and reliability of scores (Alterman, Cacciola, & Rutherford, 1993). The third step is the use of multiple raters in high stakes cases; average ratings should be eschewed, consensus ratings should be sought. The fourth step is the implementation of audit systems— including peer review—for the detection of rater drift (Cooke, 2009b). Deriving Inferences About Individuals from Inferences About Groups We recognize that some of our conclusions may be surprising— perhaps even controversial—as there is a widespread acceptance of the prediction paradigm. However, should we be surprised that we find it difficult to predict what any individual will do in the future? Consider just some of the factors that affect predictive accuracy: The lack of reliability in the predictor and outcome variables; the relative weakness of the association between these variables; the inherent variability across individuals—and within individuals and their circumstances across time— and the multitudinous causes that result in violent crime. Perhaps we have become over-confident. Studies of judgment under uncertainty have indicated human tendencies both to be overconfident in predictions (Kahneman & Tversky, 1973) and overly narrow in CI estimates (Alpert & Raiffa, 1982). Professionals are not immune from these biases. 1.6 1.7 1.8 1.9 Height (m) 40 60 80 100 120 Weight (kg) Predicted Weight Mean Prediction Fig. 7 Group and individual CIs around the prediction of weight from knowledge of an individual’s height: Simulation with n = 2,000 270 Law Hum Behav (2010) 34:259–274 123 The findings we present about predictions in individual cases reflect a problem of inference that is long recognized in psychology and other disciplines more generally (e.g., Altman & Royston, 2000; Henderson & Keiding, 2005). Discussing child development, Lewin (1931; as cited in Richters, 1997) noted ‘‘An inference from the average to the particular case is … impossible.’’ (Richters, 1997, p. 199). Discussing the medical application of prognostic models, Altman and Royston (2000) noted ‘‘…the distinction between what is achievable at the group and individual levels is not well understood’’ (p. 454). The problem pertains even under ideal conditions: Henderson and Keiding (2005), discussing survival time prediction in relation to virulent non-small-cell lung cancer, indicated ‘‘…the intrinsic statistical variations in life times are so large that predictions based on statistical models and indices are of little use for individual patients. This applies even when the prognostic model is known to be true and there is no statistical uncertainty in parameter estimation’’ (p. 703). Why is this so? Confidence Intervals and Prediction Intervals As we indicated in our exegesis of the statistical principles underlying this problem, the CIs for model parameters are different from the CIs around the prediction for a new case. The latter always being substantially wider than the former. Also, prediction intervals are little influenced by the size of the sample used to develop the statistical model (Steel et al., 1997). Collecting bigger samples is not a solution. We demonstrated this empirically by contrasting the Scottish sample with the MacArthur sample. The distinction between CIs and prediction intervals is made in other areas of assessment, e.g., intelligence testing. An example may demonstrate the pervasiveness of the prediction problem when applied to the individual. The Wechsler Abbreviated Scale of Intelligence (WASI; Psychological Corporation, 1999) is a brief test of intellectual functioning, which can be used to predict an individual’s performance on the ‘‘gold standard’’ Wechsler Intelligence Scale for Children—Third Edition (WISC-III; Wechsler, 1991). Note that these are very reliable tests, note also, they measure within the same conceptual domain (using very similar procedures), and that the correlation between the two tests is very high (Full-scale IQ r = .87). An individual assessed on the WASI with a Full Scale IQ of 70 (90% CI 66–76) will have a predicted WISC-III Full Scale IQ of 70 (90% Prediction Interval 62–87; Psychological Corporation, 1999). Thus, even in these ideal conditions—the same conceptual domain, highly reliable tests that are highly correlated—the prediction interval is 2.5 times greater than the equivalent CI. It is not surprising that the difference between the CI and the prediction interval is even greater when the link between PCL-R scores and future violence is considered. In the Scottish sample, for a mean PCL-R score the best estimate of the probability of reoffending violently is 14%, the CI is between 10% and 20%, whereas the prediction interval is between 0% and 98%. In this case, the prediction interval is almost ten times the CI. This problem of moving from the general to the specific is not merely a matter of statistics; it is also a matter of logic (Haje´k & Hall, 2002; Hart et al., 2007). The application of between-subject information to guide withinsubject causal inference is subject to the logical fallacy of division (Rorer, 1990). One form of this fallacy rests on drawing an invalid conclusion about an individual member of a group based on the collective properties of the group. For example, it is obviously fallacious to argue that if, in general, intelligent people earn more than less intelligent people then Jules, with an IQ of 120, will earn more than Jim with an IQ of 100. Equally, it is fallacious to argue that although, in general, people who score highly on the PCLR re-offend more than people who do not score highly, Bill with an PCL-R score of 30 will re-offend more often than Brian with a PCL-R score of 10. A common defense of the actuarial approach is founded upon this fallacy. ‘‘If it is alright for life insurance companies, it should be alright for psychology.’’ The analogue is false. The actuary makes a profit by predicting the proportion of insured lives that will end in a particular time period: The actuary has no interest in predicting the deaths of particular individuals. There is a growing awareness in psychology that between-subject models cannot test or support causal accounts (e.g., pertaining to earning potential or violence) that are valid at the individual level (Borsboom, Mellenbergh, & van Heeran, 2003; Richters, 1997). With a between-subjects design it is possible to argue legitimately that within population differences in psychopathy can cause differences in population differences in violent reoffending. However, this position cannot be defended at the level of the individual; this is because there is an unspoken assumption that the mechanisms that operate at the level of the individual also explain variations between individuals. Richters (1997) clarified the basis of the problem: The extraordinary human capacity for equifinal and multifinal functioning, however, render the structural homogeneity assumption untenable. Very similar patterns of overt functioning may be caused by qualitatively differing underlying structures both within the same individual at different points in time, and across different individuals at the same time (equifinality) (pp. 206–207). Individuals are violent for different reasons: Any one individual may be violent for different reasons on different occasions. Law Hum Behav (2010) 34:259–274 271 123 In summary, on the basis of empirical findings, statistical theory, and logic it is clear that predictions of future offending cannot be achieved, with any degree of confidence, in the individual case. CONCLUSION We emphasize again that the problems identified in this article are not unique to the PCL-R. In some sense our ability to demonstrate these problems with the PCL-R is a reflection of the success of this test: It is used extensively and thus large datasets are available; It has been subject to considerable psychometric evaluation. Other tools used in forensic settings will be subject to similar limitations. For example, the precision with which individuals can be allocated to risk ‘‘bins’’ by actuarial risk tools is influenced by the reliability of scoring and the underlying distribution of scores. Faigman (2007) argued that psychology has ignored the problem of translating scientific research into findings that help triers of fact; he indicates that psychology has to take on the ‘‘monumental intellectual challenge’’ (p. 313) of making the inferential leap between populationlevel findings and individual-level findings relevant to courts. We ignore this challenge at our peril. Tentative steps toward meeting this challenge are discussed elsewhere (Cooke, 2009b). This article is not without limitations. First, it is based on males. We know little about the reliability of the diagnosis and predictive utility in females or, indeed, whether the instrument functions adequately in females or other populations (Forouzan & Cooke, 2005; Verona & Vitale, 2006). Second, the study is focused on adults. The potential for lifechanging decisions may be even greater when related procedures are applied to adolescents; less information is generally available to make a diagnosis in adolescents (Edens & Petrila, 2006). The methods we used are explicated in detail in this article so that others can apply them to their own—hopefully diverse—datasets. REFERENCES Alpert, M., & Raiffa, H. (1982). A progress report on the training of probability assessors. In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment under uncertainty: Heuristics and biases (pp. 294–305). New York: Cambridge University Press. Alterman, A. I., Cacciola, J. S., & Rutherford, M. J. (1993). Reliability of the Revised Psychopathy Checklist in substance abuse patients. Psychological Assessment, 5, 442–448. doi:10.1037/1040-3590. 5.4.442. Altman, D. G., & Royston, P. (2000). What do we mean by validating a prognostic model? Statistics in Medicine, 19, 453–473. doi: 10.1002/(SICI)1097-0258(20000229)19:4\453::AID-SIM350[ 3.0.CO;2-5. American Educational Research Association/American Psychological Association. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Boccaccini, M. T., Turner, D. B., & Murrie, D. C. (2008). Do some evaluators report consistently higher or lower PCL-R scores than others? Findings from a statewide sample of sexually violent predator evaluations. Psychology, Public Policy, and Law, 14, 262–283. doi:10.1037/a0014523. Borsboom, D., Mellenbergh, G. J., & van Heeran, J. (2003). The theoretical status of latent variables. Psychological Review, 110, 203–219. doi:10.1037/0033-295X.110.2.203. Bradfield, R. B., Huntzickler, P. B., & Fruehan, G. J. (1970). Errors of group regression for prediction of individual energy expenditure. The American Journal of Clinical Nutrition, 23, 1015–1016. Colditz, G. A. (2001). Cancer culture; epidemics, human behavior, and the dubious search fro new risk factors. American Journal of Public Health, 91, 357–359. doi:10.2105/AJPH.91.3.357. Cooke, D. J. (2008). Psychopathy as an important forensic construct: Past, present and future. In D. Canter & R. Zukauskiene (Eds.), Psychology, crime & law. New horizons—International perspectives. Aldershot: Ashgate. Cooke, D. J. (2009a). Psychopathy. In E. A. Campbell & J. Brown (Eds.), Cambridge handbook of forensic psychology. Cambridge: Cambridge University Press. Cooke, D. J. (2009b). Strengths and limitations of the Psychopathy Checklist Revised (PCL-R) in courts and other tribunals (Paper under preparation). Cooke, D. J., & Michie, C. (1997). An Item Response Theory evaluation of Hare’s Psychopathy Checklist. Psychological Assessment, 9, 2–13. doi:10.1037/1040-3590.9.1.3. Cooke, D. J., Michie, C., & Hart, S. D. (2006). Facets of clinical psychopathy: Towards clearer measurement. In C. J. Patrick (Ed.), Handbook of psychopathy (pp. 91–106). New York: The Guilford Press. Cooke, D. J., Michie, C., Hart, S. D., & Clark, D. (2005). Assessing psychopathy in the United Kingdom: Concerns about crosscultural generalisability. The British Journal of Psychiatry, 186, 339–345. doi:10.1192/bjp.186.4.335. Cooke, D. J., Michie, C., & Ryan, J. (2001). Evaluating risk for violence: A preliminary study of the HCR-20, PCL-R and VRAG in a Scottish prison sample. Edinburgh: Scotland Office. Copas, J., & Marshall, P. (1998). The offender group reconviction scale: A statistical reconviction score for use by probation officers. Applied Statistics, 47, 159–171. doi:10.1111/1467-9876. 00104. DeMatteo, D., & Edens, J. F. (2006). The role and relevance of the Psychopathy Checklist-Revised in court. A case law survey of U.S courts (1991–2004). Psychology, Public Policy, and Law, 12, 214–241. doi:10.1037/1076-8971.12.2.214. Douglas, K. S., Vincent, G. M., & Edens, J. F. (2006). Risk for criminal recidivism: The role of psychopathy. In C. J. Patrick (Ed.), Handbook of psychopathy (pp. 533–554). New York: The Guilford Press. DSPD Programme. (2005). Dangerous and Severe Personality Disorder (DSPD) High Secure Services for Men. London: DSPD Programme, Department of Health, Home Office, HM Prison Service. Edens, J. F., & Petrila, J. (2006). Legal and ethical issues in the assessment and treatment of psychopathy. In C. J. Patrick (Ed.), Handbook of psychopathy (pp. 573–588).New York: The Guilford Press. Elmore, J. G., & Fletcher, S. W. (2006). The risk of cancer risk prediction: ‘‘What is my risk of getting breast cancer? Journal of the National Cancer Institute, 98, 1673–1675. 272 Law Hum Behav (2010) 34:259–274 123 Faigman, D. L. (2007). The limits of science in the courtroom. In E. Borgida & S. T. Fiske (Eds.), Beyond common sense: Psychological science in the courtroom (pp. 303–313). Oxford: Blackwell. Fitch, W. L., & Ortega, R. J. (2000). Law and the confinement of psychopaths. Behavioral Sciences & the Law, 18, 663–678. doi: 10.1002/1099-0798(200010)18:5\663::AID-BSL408[3.0.CO;2-V. Fleiss, J. L. (1981). Statistical methods for rates and proportions. New York: Wiley. Forouzan, E., & Cooke, D. J. (2005). Figuring out la femme fatale: Conceptual and assessment issues concerning psychopathy in females. Behavioral Sciences and the Law, 23, 765–778. Gail, M. H., & Benichou, J. (2000). Encyclopedia of epidemiological methods. Chichester: Wiley. Guilford, M. C., Rona, R. J., & Chinn, S. (1992). Trends in body mass index in young adults in England and Scotland from 1973 to 1988. Journal of Epidemiology and Community Health, 46, 187–190. doi:10.1136/jech.46.3.187. Haje´k, A.,&Hall, N. (2002). Induction and probability. In P. Machamer & M. Silberstein (Eds.), Blackwell guide to the philosophy of science (pp. 149–172). Oxford: Blackwell. Hanson, R. K., & Thornton, D. M. (1999). Static 99: Improving actuarial risk assessments for sex offenders. Ottawa: Public Works and Government Services Canada. Hare, R. D. (1991). The Hare Psychopathy Checklist—Revised (1st ed.). Toronto: Multi-Health Systems. Hare, R. D. (1993). Without conscience: The disturbing world of the psychopaths among us (1st ed.). New York: Pocket Books. Hare, R. D. (1998). The Hare PCL-R: Some issues concerning its use and misuse. Legal and Criminological Psychology, 3, 101–119. Hare, R.D. (2003). TheHare Psychopathy Checklist—Revised (2nd ed.). Toronto: Multi-Health Systems. Hart, S. D. (1998). The role of psychopathy in assessing risk for violence: Conceptual and methodological issues. Legal and Criminological Psychology, 3, 121–137. Hart, S. D. (2001). Forensic issues. In W. J. Livesley (Ed.), Handbook of personality disorders: Theory, research, and treatment (pp. 555–569). New York: The Guilford Press. Hart, S. D., Cox, D. N., & Hare, R. D. (1995). The Hare Psychopathy Checklist: Screening version (1st ed.). Toronto: Multi-Health Systems. Hart, S. D., & Hare, R. D. (1997). Psychopathy: Assessment and association with criminal conduct. In D. M. Stoff, J. Breiling, & J. D. Maser (Eds.), Handbook of antisocial behavior (pp. 22–35). New York: Wiley. Hart, S. D., Michie, C.,&Cooke, D. J. (2007). The precision of actuarial risk assessment instruments: Evaluating the ‘‘Margins of Error’’ of group versus individual predictions of violence. The British Journal of Psychiatry, 170(Suppl 49), 60–65. doi:10.1192/bjp. 190.5.s60. Hawthorne, V. M., Murdoch, R. M., & Womersley, J. (1979). Body weight of men and women aged 40–64 years from an urban area in the West of Scotland. Community Medicine, 1, 229–235. Hemphill, J. F., Hare, R. D., & Wong, S. (1998). Psychopathy and recidivism: A review. Legal and Criminological Psychology, 3, 139–170. Hemphill, J. F., & Hart, S. D. (2002). Motivating the unmotivated: Psychopathy, treatment, and change. In M. McMurran (Ed.), Motivating offenders to change: A guide to enhancing engagement in therapy (pp. 193–220). Chichester: Wiley. Henderson, R., Jones, M., & Stare, J. (2001). Accuracy of point predictions in survival analysis. Statistics in Medicine, 20, 3083– 3096. doi:10.1002/sim.913. Henderson, R., & Keiding, N. (2005). Individual survival time prediction using statistical models. Journal of Medical Ethics, 31, 703–706. doi:10.1136/jme.2005.012427. Kahneman, D.,&Tversky, A. (1973). On the psychology of prediction. Psychological Review, 80, 237–251. doi:10.1037/h0034747. Kennaway, R. (1998). Population statistics cannot be used for reliable individual prediction. Retrieved October 12, 2006, from http://citeseer.ist.psu.edu/328224.html. Leistico, A. R., Salekin, R. T., DeCoster, J., & Rogers, R. (2008). A large-scale meta-analysis relating the Hare measures of psychopathy to antisocial conduct. Law and Human Behavior, 32, 28–45. doi:10.1007/s10979-007-9096-6. Logan, C., & Johnstone, L. (2008). Personality disorders: Clinical and risk formulations (Paper under review). Lyon, D., & Ogloff, J. R. P. (2000). Legal and ethical issues in psychopathy assessment. In C. B. Gacono (Ed.), The clinical and forensic assessment of psychopathy (pp. 139–173). Mahwah, NJ: Lawrence Erlbaum Associates. Maden, A., & Tyrer, P. (2003). Dangerous and severe personality disorders: A new personality concept from the United Kingdom. Journal of Personality Disorders, 17, 489–496. doi:10.1521/pedi. 17.6.489.25356. MATHCAD.13. (2005). Mathcad 13 user’s guide. Cambridge, MA: Mathsoft Engineering and Education, Inc. Michie, C., & Cooke, D. J. (2006). The structure of violent behavior: A hierarchical model. Criminal Justice and Behavior, 33, 706–737. doi:10.1177/0093854806288941. Monahan, J., Steadman, H., Robbins, P. C., Appelbaum, P., Banks, S., Grisso, T., et al. (2005).An actuarialmodel of violence. Psychiatric Services, 56, 810–815. doi:10.1176/appi.ps.56.7.810. Monahan, J., Steadman, H., Silver, E., Appelbaum, P., Robbins, P. C., Mulvey, E. P., et al. (2001). Rethinking risk assessment: The MacArthur study of mental disorder and violence (1st ed.). New York: Oxford University Press. Mooney, C. Z. (1997). Monte Carlo simulation. Thousand Oaks, CA: Sage. Murrie, D. C., Boccaccini, M. T., Johnson, J. T., & Janke, C. (2008). Does interrater (dis)agreement on Psychopathy Checklist scores in Sexually Violent Predator trials suggest partisan allegiance in forensic evaluations? Law and Human Behavior, 32(4), 352–362. doi:10.1007/s10979-007-9097-5. Murrie, D. C., Boccaccini, M. T., Turner, D., Meeks, M., Woods, C., & Tussey, C. Rater (dis)agreement on risk assessment measures in sexually violent predator proceedings: Evidence of adversarial allegiance in forensic evaluation. Psychology, Public Policy, and Law, in press. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill. Psychological Corporation. (1999). Wechsler Abbreviated Scale of Intelligence (WASI) manual. San Antonio, TX: Psychological Corporation. Quinsey, V. L., Harris, G. T., Rice, M. E., & Cormier, C. A. (1998). Violent offenders: Appraising and managing risk (1st ed.). Washington, DC: American Psychological Association. Richters, J. E. (1997). The Hubble hypothesis and the developmentalist’s dilemma. Development and Psychopathology, 9, 193–229. doi:10.1017/S0954579497002022. Robert, C. P. (2004). Monte Carlo statistical methods. New York: Springer-Verlag. Rockhill, B. (2001). The privatization of risk. American Journal of Public Health, 91, 365–368. doi:10.2105/AJPH.91.3.365. Rockhill, B., Kawachi, I., & Colditz, G. A. (2000). Individual risk prediction and population-wide disease prevention. Epidemiologic Reviews, 22, 176–180. Rorer, L. (1990). Personality assessment: A conceptual survey. In L. A. Pervin (Ed.), Handbook of personality: Theory and research (pp. 693–720). New York: The Guilford Press. Rose, G. (1992). The strategy of preventative medicine. Oxford: Oxford Medical Publications. Law Hum Behav (2010) 34:259–274 273 123 Salekin, R. T., Rogers, R., & Sewell, K. W. (1996). A review and meta-analysis of the Psychopathy Checklist and Psychopathy Checklist-Revised: Predictive validity of dangerousness. Clinical Psychology: Science and Practice, 3, 203–215. Scott, K. G. (2003). Commentary: Individual risk prediction, individual risk, and population risk. Journal of Clinical Child and Adolescent Psychology, 32, 243–245. doi:10.1207/S15374424JCCP3202_9. Steel, R. G. D., Torrie, J. H., & Dickey, D. A. (1997). Principles and procedures of statistics: A biometrical approach. New York: McGraw Hill. Tam, C. C., & Lopman, B. A. (2003). Determinism versus stochasticism: In support of long coffee breaks. Journal of Epidemiology and Community Health, 57, 478. doi:10.1136/jech.57.7.477. Verona, E., & Vitale, J. (2006). Psychopathy in women: Assessment, manifestations and etiology. In C. J. Patrick (Ed.), Handbook of psychopathy (pp. 415–436). New York: The Guilford Press. Wald, N. J., Hackshaw, A. K., & Frost, C. D. (1999). When can a risk factor be used as a worthwhile screening test. British Medical Journal, 319, 1562–1565. Walsh, T., & Walsh, Z. (2006). The evidentiary introduction of the Psychopathy Checklist-Revised assessed psychopathy in U.S. courts: Extent and appropriateness. Law and Human Behavior, 30, 493–507. doi:10.1007/s10979-006-9042-z. Walters, G. D. (2003). Predicting criminal justice outcomes with the Psychopathy Checklist and Lifestyle Criminality Screening Form: A meta-analytic comparison. Behavioral Sciences & the Law, 21, 89–102. doi:10.1002/bsl.519. Wechsler, D. (1991). Wechsler Intelligence Scale for Children (3rd ed.). San Antonio, TX: The Psychological Corporation. Zinger, I., & Forth, A. E. (1998). Psychopathy and Canadian criminal proceedings: The potential for human rights abuses. Canadian Journal of Criminology, 40, 237–276. 274 Law Hum Behav (2010) 34:259–274 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

IS IT YOUR FIRST TIME HERE? WELCOME

USE COUPON "11OFF" AND GET 11% OFF YOUR ORDERS