PSYCHOLOGY PAPER

There is a movement in psychology to require the usage of psychological measures to aid in the diagnosis of some mental illnesses. While developing the DSM-V, there was some discussion of including results of specific measures as part of diagnostic criteria. Considering the two articles you read, Limitations of Diagnostic Precision and Predictive Utility in the Individual Case: A Challenge for Forensic PracticeÂ and Diagnostic Utility of the NAB List Learning Test in Alzheimerâ€™s Disease and Amnestic Mild Cognitive Impairment,write a response paper discussing the utility of using psychological measures as suggested above in the diagnosis of mental illness.

ORIGINAL ARTICLE

Limitations of Diagnostic Precision and Predictive Utility

in the Individual Case: A Challenge for Forensic Practice

David J. Cooke Æ Christine Michie

Received: 24 August 2007 / Accepted: 2 February 2009 / Published online: 11 March 2009

American Psychology-Law Society/Division 41 of the American Psychological Association 2009

Abstract Knowledge of group tendencies may not assist

accurate predictions in the individual case. This has

importance for forensic decision making and for the

assessment tools routinely applied in forensic evaluations.

In this article, we applied Monte Carlo methods to examine

diagnostic agreement with different levels of inter-rater

agreement given the distributional characteristics of PCL-R

scores. Diagnostic agreement and score agreement were

substantially less than expected. In addition, we examined

the confidence intervals associated with individual predictions

of violent recidivism. On the basis of empirical

findings, statistical theory, and logic, we conclude that

predictions of future offending cannot be achieved in the

individual case with any degree of confidence. We discuss

the problems identified in relation to the PCL-R in terms of

the broader relevance to all instruments used in forensic

decision making.

There is an important disjunction between the perspective

of science and the perspective of the law; while science

seeks universal principles that apply across cases, the law

seeks to apply universal principles to the individual case.

Bridging these perspectives is a major challenge for psychology

(Faigman, 2007). It is recognized by statisticians

that knowledge of group tendencies—even when precise—

may not assist accurate evaluation of the individual case

(e.g., Colditz, 2001; Henderson & Keiding, 2005; Rockhill,

2001; Tam & Lopman, 2003). It is a statistical truism that

the mean of a distribution tells us about everyone, yet no

one. This has serious implications for the use of psychological

tests in forensic decision making. To illustrate these

limitations, we focus on one of the most widely used, and

perhaps the most extensively validated, test in the forensic

arena—the Psychopathy Checklist Revised (PCL-R1; Hare,

2003). We emphasize, however, that all psychological tests

used in the same way in the forensic arena will suffer from

similar limitations (e.g., VRAG, Quinsey, Harris, Rice, &

Cormier, 1998; Static-99, Hanson & Thornton, 1999;

COVR, Monahan et al., 2005).

Mental health professionals are frequently asked to

opine whether an individual might be violent in the future;

psychopathic personality disorder is an important risk

factor to consider (Hart, 1998). The PCL-R is the most

frequently used measure of psychopathic personality disorder;

it has been described as the ‘‘gold standard’’ for that

purpose (Edens, Skeem, Cruise, & Cauffman, 2001; as

cited in Hare, 2003). There can be little doubt that the PCLR

has made a major contribution to our understanding of

violence (Hart, 1998); nonetheless, it is important for the

field to consider both its strengths and its limitations.

Findings for this instrument will have implications for less

well-validated tools. In this introduction, we consider two

issues; first, the use of PCL-R scores in forensic practice

and second, the general problem of the precision of predictions

about an individual case.

D. J. Cooke (&) C. Michie

Department of Psychology, Glasgow Caledonian University,

Glasgow G4 0BA, UK

e-mail: [email protected]

1 The PCL-R is a 20-item rating scale of traits and behaviors intended

for use in a range of forensic settings. Definitions of each item are

provided and evaluators rate the lifetime presence of each item on a

3-point scale (0 = absent, 1 = possibly or partially present, and

2 = definitely present) on the basis of an interview with the

participant and a review of case history information.

123

Law Hum Behav (2010) 34:259–274

DOI 10.1007/s10979-009-9176-x

PCL-R SCORES AND FORENSIC PRACTICE

Much of the interest in the construct psychopathy comes

from the relationship between the PCL-R and future

criminal behavior (Lyon & Ogloff, 2000). Previous

research suggests that psychopathy—as assessed using the

Psychopathy Checklist-Revised (PCL-R; Hare, 1991)—is

an important risk marker for criminal and violent behavior

(Douglas, Vincent, & Edens, 2006; Hart, 1998; Hart &

Hare, 1997; Hemphill, Hare, & Wong, 1998; Leistico,

Salekin, DeCoster, & Rogers, 2008; Salekin, Rogers, &

Sewell, 1996). In fact, the PCL-R has been lauded as an

‘‘unparalleled’’ single predictor of violence (Salekin et al.,

1996). Hart (1998) argued that failure to consider psychopathy

in a violence risk assessment may constitute

professional negligence. This empirical base has resulted in

the PCL-R being used, not merely to measure the trait

strength of psychopathy in an individual, but also to make

predictions about what he or she will do in the future (Hare,

1993). As we demonstrate formally below, this additional

step of prediction means that the potential for imprecision

in forensic evidence is greatly increased: It expands the

gulf between inferences about groups and inferences about

individuals.

The PCL-R has been incorporated into statutory or legal

decision making (Hare, 2003). Within England and Wales,

a PCL-R score above a cut-off of 25 or 30 can lead to

detention in either a Special Hospital or a prison (Maden &

Tyrer, 2003); in certain Canadian provinces parole boards

explicitly consider PCL-R scores (Hare, 2003), and in

Texas psychopathy assessments are mandated by statute for

sexual predator evaluation (Edens & Petrila, 2006).2 The

PCL-R plays a role in criminal sentencing, including

decisions regarding indefinite commitment and capital

punishment, institutional placement and treatment, conditional

release, juvenile transfer, child custody, witness

credibility, civil torts, and indeterminate civil commitment

(DeMatteo & Edens, 2006; Fitch & Ortega, 2000; Hart,

2001; Hemphill & Hart, 2002; Lyon & Ogloff, 2000;

Walsh & Walsh, 2006; Zinger & Forth, 1998). The PCL-R

is regarded by many as the best method for operationalizing

the construct of psychopathy. For example, Lyon and

Ogloff (2000) argued that ‘‘…it is critical that the assessment

is made using the PCL-R’’ (p. 166) when evidence

about violence risk, based on psychopathy, is provided.

Because of its central role in forensic decision making it is

vital to assess its strengths and limitations and, by comparison,

the limitations of less well-validated procedures.

PREDICTIONS FOR INDIVIDUALS VERSUS

PREDICTIONS FOR GROUPS

Prediction is the raison d’eˆtre of many forensic instruments

(e.g., VRAG, Quinsey et al., 1998; Static-99, Hanson &

Thornton, 1999; COVR, Monahan et al., 2005). While this is

not true of the PCL-R its frequent use in forensic practice is

underpinned by the assumption—implicit or explicit—that it

can predict future offending (Walsh & Walsh, 2006). How

precise can such predictions be? The precision of any estimate

of a parameter (e.g., mean rate of recidivism of a group)

can be measured by the width of a confidence interval (CI); a

CI gives an estimated range of values, which is likely to

include an unknown population parameter. If independent

samples are taken repeatedly from the same population, and a

CI calculated for each sample, then a certain percentage

(confidence level) of the intervals will include the unknown

population parameter. Typically, 95% of these intervals

should include the unknown population parameter; other

intervals may be used (e.g., 68% and 99%). The width of this

interval provides a measure of the precision—or certainty—

that we can have in the estimate of the population parameter.

The width of a CI of a population parameter is linked, in part,

to the sample size used to estimate the population parameter

(see below for a more technical explanation).

The prevailing prediction paradigm has two stages.

First, the parameters (mean, slope, and variance) of a

regression model linking an independent variable (e.g.,

PCL-R score) to a dependent variable (e.g., likelihood of

reconviction) are estimated. Each of these parameters has

uncertainty associated with them, which can be expressed

by confidence bands about the regression line. Second, a

new case is selected and the PCL-R score is assessed, the

model is applied and the likelihood of reconviction is

estimated. The best estimate of the likelihood of reconviction

for a new case will be identical to the point on the

regression line for that PCL-R score. This new estimate has

a CI—also known as a prediction interval—that expresses

the precision, or certainty, that should be associated with

the prediction made about the new case. Often the two

steps are conflated, with the unrecognized assumption

being made that the prediction interval for the new case is

comparable to the CIs for the model. It is not (see below).

The problem of making predictions for individuals from

statistical models is now recognized in other disciplines. In

relation to medical risks, Rose (1992) expressed the position

clearly: ‘‘Unfortunately the ability to estimate the

average risk for a group, which may be good, is not matched

by any corresponding ability to predict which

individuals are going to fall ill soon’’ (p. 48). In relation to

reoffending, Copas and Marshall (1998) made a related

point ‘‘…the score is not a prediction about an individual

[italics added], but an estimate of what rate of conviction

2 The PCL-R is the most commonly used instrument for assessing

psychopathy in this setting (Mary Alice Conroy, Personal communication,

10 April 2007).

260 Law Hum Behav (2010) 34:259–274

123

might be expected of a group [italics added] of offenders

who match that individual on the set of covariates used by

the score’’ (p. 170) (see also Altman & Royston, 2000;

Bradfield, Huntzickler, & Fruehan, 1970; Colditz, 2001;

Elmore & Fletcher, 2006; Henderson, Jones, & Stare, 2001;

Henderson & Keiding, 2005; Rockhill, 2001; Rockhill,

Kawachi, & Colditz, 2000; Tam & Lopman, 2003; Wald,

Hackshaw, & Frost, 1999).

It is not generally recognized that a risk factor must have

a very strong relative risk (i.e.,[50) if it is to have utility as

a screening instrument at the individual level (Rockhill

et al., 2000; see also Kennaway, 1998). However, others set

the bar higher:

A risk factor has to be extremely strongly associated

with a disease within a population before it can be

considered to be a potentially useful screening test.

Even a relative odds of 200 between the highest and

lowest fifths will yield a detection rate of no more

than about 56% for a 5% false positive rate… (Wald

et al., 1999, p. 1564).

To put this in perspective, the relative risk for the

association between lung cancer and smoking is between

10 and 15 (Rockhill et al., 2000), depending on the definition

of exposure. The relative risk for the PCL-R and

recidivism is something of the order of 3 for general

recidivism and 4 for violent recidivism (Hare, 2003).

Does the application of current forensic tools provide an

adequate basis for testimony concerning the individual case?

In this article, we attempt to answer this question by considering

three issues pertaining to PCL-R data. How

confident can clinicians and legal decision makers be, first, in

the use of critical diagnostic cut-offs; second, in the

numerical value of PCL-R scores; and third, in individual

predictions of violent recidivism? We describe two studies.

The first study addresses the accuracy of diagnostic decisions

and the potential range of discrepancies between two raters.

The second study addresses the accuracy of prediction of

future violence in the individual case. The results have relevance

beyond the PCL-R to the use of other psychometric

instruments in forensic practice: The same limitations may

apply to many forensic assessment instruments.

STUDY ONE

In the first study, we examined diagnostic accuracy, specifically

the allocation of individuals around two critical

cut-offs, i.e., around 30 and around 25; the first is the

standard PCL-R cut-off for the diagnosis of psychopathy

and the second, often adopted in the UK, has proven useful

in that context including in decisions regarding treatment

allocation (Hare, 2003).

The inter-rater reliability figures presented in the PCL-R

manual can be regarded as good (Nunnally & Bernstein,

1994); intraclass correlation coefficient for single ratings

(ICC1) are estimated in some research studies as being

above .80 (Male offenders = .86; Male forensic psychiatric

patients = .88; Hare, 2003, Table 5.4).3 Edens and

Petrila (2006) indicated that these are probably ‘‘best case’’

estimates and ‘‘real world’’ reliabilities may be substantially

poorer.4 Murrie, Boccaccini, Johnson, and Janke

(2008), in one ‘‘real world’’ study, demonstrated poor

agreement (ICC1 = .39). These views and findings echo

concerns expressed by Hare (1998), that while researchers

take great pains to ensure reliability in their studies, the

level of reliability achieved by individual clinicians

remains unknown—and by implication—is likely to be

poorer than published studies. Inter-rater reliability is not

the only relevant consideration: Diagnostic precision is

also influenced by the underlying distribution of test scores.

Diagnostic precision is influenced by the location of the

cut-off and the shape of the distribution of scores—both

skewness and kurtosis. Estimates of the precision of a test

score (e.g., standard errors of measurement, SEM) are

weighted toward the mean of the distribution whereas cutoffs

are generally located substantially above the mean.

Item Response Theory (IRT) studies demonstrate that the

measurement precision of the PCL-R—in terms of measurement

information—falls toward the diagnostic cut-off

(Cooke, Michie, & Hart, 2006); thus, the SEM estimated

on the mean will provide an optimistic estimate of diagnostic

precision.

The SEM cannot be directly translated into estimates of

precision of diagnosis because of the impact of the score

distributions. Equally, it is not possible to estimate misclassification

rates directly using ICC1 values; therefore,

simulation approaches are required. Study one describes a

simulation that examines the impact of unreliability on

diagnostic accuracy.

Method

Monte Carlo studies allow the investigation of the properties

of distributions and estimates of parameters where

results cannot be derived theoretically (Mooney, 1997;

Robert, 2004). Large numbers of simulated datasets can be

3 The estimates of reliability are frequently obtained by re-rating the

same interview or with an observer simultaneously rating within an

interview. This will tend to inflate reliability, but not validity, as the

same information source is being used.

4 The case of THE PEOPLE, Plaintiff and Respondent, v. KURT

ADRIAN PARKER, a Sexual Violent Predator ACT case highlights

the variability that can emerge in some cases; five accredited experts

furnished five PCL-R scores that ranged from 10 to 25. (Edens, John,

Personal Communication, 22 May 2006).

Law Hum Behav (2010) 34:259–274 261

123

created based on an explicit and replicable data-generation

process. The effect of known features designed into the

data, such as levels of inter-rater reliability, on outcomes,

such as diagnostic precision, can be assessed. Multiple

trials of procedures are carried out to allow precise estimation

of outcomes. Mooney (1997) argued that Monte

Carlo simulations could allow social scientists to test

classical parametric inference methods and provide more

accurate statistical models. In our view, this mainstream

statistical technique is underused in forensic research.

Materials

We used Monte Carlo techniques based on distribution

information from two datasets of PCL-R total scores: (1)

data for North American Male Offenders (Table 9.1, Hare,

2003) and (2) data from UK prisoners (Cooke, Michie,

Hart, & Clark, 2005).5

The first distribution, being the largest, probably provides

the best estimate of the true distribution of scores

underlying the PCL-R and is described as ‘‘approximately

normal’’ (Hare, 2003, p. 55). Given the potential impact of

a departure from normality we tested whether this distribution

was in fact normal. The departure from normality

was highly significant (Kolmogorov–Smirnov = .068,

df = 5408, p\.0001, Skewness = -.33, Kurtosis =

-.570). Examination of Fig. 1 demonstrates that around the

standard cut-off of 30 cases are over-represented while in

the right tail of the distribution they are under-represented.

In the simulation study,we generated two randomvariables

per case using MATHCAD.13 (2005). These random variableswere

scaled according to one of the two datasets referred

to above with mean l and standard deviation r.6 This gives

two uncorrelated ratings (x1 and x2) from the distribution of

scores: x1 is our first rating on the subject, PCL1. We then

calculated a linear combination of the two ratings to provide a

second rating on the same subject, which has a correlation of

q with the first rating. The linear combination is PCL2 ¼

roundf l þ ðx1 lÞq þ ðx2 lÞ

ffiffiffiffiffiffiffiffiffiffiffiffiffi

1 q2

p

g ; using rounding

to ensure an integer score. This process gives two random,

correlated scores from the distribution. There is a very small

probability of obtaining second ratings less than 0 or greater

than 40: These scores have been taken as 0 or 40, respectively.

Assuming that the ICC1 represents the best estimate of

the correlation between the two scores, we estimated the

distributions for four values of reliability, i.e., ICC1 values

of .75, .80, .85, and .90. The .80 value is a lower bound

estimate for reasonable practice. Hare (1998) indicated that

at least this level should be achievable with ‘‘…properly

conducted assessments’’ (p. 107). The .85 level may be

achievable by one rater with good training; the .90 level,

perhaps the best case scenario, is the level achievable where

two independent sets of ratings are averaged. Values above

.90 are rarely if ever achievable—Hare (1998) describes .95

and higher as ‘‘unbelievably high’’ (p. 107). The .75 provides

a lower-bound estimate of what may be obtained in

clinical practice. These values probably represent optimistic

estimates for actual clinical practice; we did not assess the

‘‘worst case’’ scenarios implied by Edens and Petrila (2006)

and by Murrie et al. (2008). The estimation procedure was

repeated 10,000,000 times for each of the four levels of

ICC1 to provide stable estimates of the distribution of the

correlated ratings and to ensure at least 10,000 cases within

each of the extreme score bands. We examine discrepancies

in two ways: First, in terms of diagnostic disagreement and

second, in terms of disagreements about total scores.

What is the level of diagnostic agreement? Kappa (j)

coefficients measure the proportion of diagnostic agreements

corrected for observed base rates (Fleiss, 1981).

Conventionally, j\.75 represents excellent agreement,

.40\j\.75 represents fair to good agreement and

j\.40 represents poor agreement (Gail & Benichou,

2000). Kappa values for three distributions and the four

ICC1 values are given in Table 1. We calculated Kappa

coefficients for agreement in diagnosis between the two

ratings using both common cut-offs, i.e., 30 and 25. The

vast majority of Kappa values are only in the fair to good

range; few values approach the poor range.

Kappa is an omnibus statistic, which is useful for

summarizing group results; however, it tells us little about

agreement in the individual case. The potential for misclassification

is clearer when distributions of disagreements

0 10 20 30 40

PCL-R

0

100

200

300

Frequency

Fig. 1 Distribution of North American male prisoners and normal

curve

5 We carried out a similar analysis of data for Male Forensic

Psychiatric Patients (Table 9.2, Hare, 2003); the results, which

demonstrate the same pattern, can be obtained from the first author.

6 A full description of the simulation study including the Mathcad

code can be obtained from the first author.

262 Law Hum Behav (2010) 34:259–274

123

are considered. The distributions based on the North

American Male Offenders are in Table 2. For ease of

interpretation, we tabulated the distributions in 5-point

ranges. Examination of the sub-table for ICC1 = .80

indicates that if one rater gives a score between 30 and 34,

i.e., just above the diagnostic cut-off then only in 46% of

occasions—approximately half the time—will the other

rater obtain a score within the same range. In 44% of the

occasions, the second rater would place the individual

below the critical cut-off. Even in the best case scenario,

i.e., ICC1 = .90, if one rater gives a score between 30 and

34 then only in 60% of occasions will the other rater obtain

a score within the same range. On 29% of occasions, the

second rater would place the participant below the critical

cut-off.

The distributions based on the UK prisoners are in

Table 3. Examination of the table for ICC1 = .80 indicates

that if one rater gives a score between 30 and 34, i.e., just

above the diagnostic cut-off then only in 39% of occasions

will the second rater obtain a score within the same range. In

54% of the cases, the second rater would place the individual

below the critical cut-off.As previously, even in the best case

scenario, i.e., ICC1 = .90, if one rater gives a score between

30 and 34 then only in 53% of cases will the other rater obtain

a score within the same range. In 39% of cases, the second

rater would place the participant below the critical cut-off.

In the UK, the cut-off of 25, as well as 30, is often applied

(DSPD Programme, 2005; Hare, 2003). Examination of the

table for ICC1 = .80 indicates that if one rater gives a score

between 25 and 29, i.e., just above theUKdiagnostic cut-off,

then only in 29% of occasions will the other rater obtain a

score within the same range. On 49% of occasions, the second

rater would place the individual below the critical cutoff.

Even in the best case scenario, i.e., ICC1 = .90, if one

rater gives a score between 25 and 29 then only in 37% of

cases will the other rater obtain a score within the same

range. On 37% of occasions, the second rater would place the

participant below the critical cut-off.

Therefore, in broad terms, all of the findings reported

above demonstrate that the allocation of an individual

above or below diagnostic cut-offs is much less precise

than previously thought.

Another way of considering the precision of PCL-R

scores is to examine expected discrepancies in scores based

on variations in ICC1 while taking into account the distributional

characteristics of the PCL-R scores. The PCL-R

manual suggests that in 68% of cases the discrepancies

between two raters should be up to 3 points, and in 95% of

cases it should be up to 6 points (Hare, 2003). This assumes

normality of the PCL-R score distribution, an assumption

that is not met (see above). The cumulative distribution of

score discrepancies estimated from the Monte Carlo studies

are tabulated in Table 4. With the North American prisoner

sample and an ICC1 of .80, a discrepancy of between 8 and

9 points would be expected in 9% of cases, around 10

points in 5% of cases, and between 12 and 13 points in 1%

of cases. With the UK prisoner sample, and an ICC1 of .80,

a discrepancy of between 8 and 9 points would be expected

in 23% of cases, around 10 points in 5% of cases, and

around 12 points in 1% of cases.

An alternative approach to summarize the range of

possible discrepancies is to estimate the distribution of a

2nd PCL-R rating given the 1st PCL-R rating. This conditional

distribution can be summarized by a CI that

contains 95% of the 2nd ratings. This interval is thus

defined by the lower and upper limits LL and UL given by

Prob(LL\2nd rating\ULj1st rating) ¼ 0:95:

Results for both 68% and 95% CIs for ICC1 = .80, and

for both samples, are presented in Table 5. For example, in

the North American prisoner sample, if rater one obtains a

total score of 30, then the 95% CI for rater two’s total score

will be between 19 and 36 (i.e., between the 35th and 99th

percentile).

All the estimates in this study are conservative; that is,

they assume that the SEM that applies at the mean applies

Table 1 Kappa coefficients and levels of agreement for four levels of correlation (q) for two distributions

q Both\30 Both C 30 Different j Both\25 Both C 25 Different j

North American male offenders

0.75 72.9 10.6 16.4 .46 48.0 29.6 22.4 .54

0.80 73.3 11.5 15.1 .51 48.9 31.0 20.0 .59

0.85 74.4 12.2 13.5 .56 50.5 32.3 17.2 .64

0.90 75.5 13.4 11.1 .64 51.7 34.2 14.1 .71

United Kingdom prisoners

0.75 91.8 2.1 6.1 .38 79.8 7.6 12.6 .47

0.80 92.0 2.4 5.6 .43 83.1 6.8 10.2 .52

0.85 92.4 2.7 5.0 .49 81.2 9.1 9.7 .60

0.90 92.7 3.1 4.2 .57 82.0 10.1 7.9 .67

Law Hum Behav (2010) 34:259–274 263

123

around the cut-off. However, this is an unwarranted

assumption. The overall variance of errors of measurement is

a weighted average of the errors that pertain across the range

of true score values. Precision of measurement of the PCL-R

drops as scores approach the diagnostic cut-off (e.g., Cooke

& Michie, 1997; Cooke et al., 2006). Thus, the degree of

diagnostic misclassification and score discrepancy is likely

to be greater in practice than demonstrated in the simulation

above. The conditional SEM (CSEM)7 is the square root of

the variance of errors at a particular level of true scores. To

Table 2 Distribution of diagnostic disagreements by four levels of

correlation between raters based on distribution of North American

male offenders

PCL-R score

0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–40

q = 0.75

0–4 .209 .086 .033 .006 0 0 0 0

5–9 .505 .305 .200 .079 .008 0 0 0

10–14 .245 .318 .250 .188 .072 .005 0 0

15–19 .041 .235 .266 .270 .239 .078 .002 0

20–24 0 .053 .192 .256 .310 .281 .097 0

25–29 0 .002 .056 .155 .236 .369 .371 .130

30–34 0 0 .003 .044 .121 .230 .445 .551

35–40 0 0 0 .001 .014 .037 .085 .319

q = 0.8

0–4 .245 .096 .032 .002 0 0 0 0

5–9 .523 .372 .211 .065 .003 0 0 0

10–14 .217 .315 .288 .198 .055 .001 0 0

15–19 .014 .198 .280 .306 .240 .055 0 0

20–24 0 .019 .162 .269 .340 .283 .063 0

25–29 0 0 .026 .137 .247 .391 .379 .076

30–34 0 0 0 .023 .106 .237 .465 .585

35–40 0 0 0 0 .010 .031 .093 .339

q = 0.85

0–4 .285 .105 .021 0 0 0 0 0

5–9 .552 .423 .217 .038 0 0 0 0

10–14 .158 .328 .333 .202 .028 0 0 0

15–19 .005 .141 .303 .354 .231 .024 0 0

20–24 0 .003 .121 .287 .386 .265 .029 0

25–29 0 0 .005 .111 .266 .430 .351 .038

30–34 0 0 0 .008 .085 .252 .519 .569

35–40 0 0 0 0 .003 .029 .101 .393

q = 0.9

0–4 .361 .103 .009 0 0 0 0 0

5–9 .578 .486 .215 .012 0 0 0 0

10–14 .061 .348 .390 .190 .010 0 0 0

15–19 0 .063 .319 .422 .216 .005 0 0

20–24 0 0 .067 .304 .449 .238 .004 0

25–29 0 0 0 .071 .267 .500 .289 .006

30–34 0 0 0 0 .056 .239 .609 .494

35–40 0 0 0 0 0 .018 .098 .501

The tables show column percentages, which sum to 1 within rounding

error. The rows therefore do not sum to 1

Table 3 Distribution of diagnostic disagreements by four levels of correlation

between raters based on distribution of UK prisoners

PCL-R score

0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–40

q = 0.75

0–4 .395 .158 .064 .020 .002 0 0 0

5–9 .477 .352 .228 .107 .035 .002 0 0

10–14 .128 .332 .309 .219 .131 .032 0 0

15–19 0 .153 .271 .336 .265 .198 .041 0

20–24 0 .005 .119 .215 .316 .297 .266 .031

25–29 0 0 .009 .086 .162 .261 .291 .254

30–34 0 0 0 .017 .083 .191 .328 .529

35–40 0 0 0 0 .006 .020 .075 .186

q = 0.8

0–4 .438 .163 .059 .011 0 0 0 0

5–9 .485 .387 .232 .097 .014 0 0 0

10–14 .077 .354 .328 .229 .111 .016 0 0

15–19 0 .095 .296 .354 .289 .154 .023 0

20–24 0 0 .083 .237 .331 .321 .214 .007

25–29 0 0 .002 .067 .176 .287 .303 .224

30–34 0 0 0 .005 .077 .200 .387 .545

35–40 0 0 0 0 .002 .022 .073 .224

q = 0.85

0–4 .470 .168 .042 .003 0 0 0 0

5–9 .474 .437 .224 .069 .003 0 0 0

10–14 .056 .342 .375 .226 .075 .003 0 0

15–19 0 .053 .308 .404 .287 .106 .001 0

20–24 0 0 .050 .251 .383 .325 .140 0

25–29 0 0 0 .047 .194 .321 .328 .145

30–34 0 0 0 0 .058 .229 .447 .578

35–40 0 0 0 0 0 .017 .084 .277

q = 0.9

0–4 .530 .169 .020 0 0 0 0 0

5–9 .450 .501 .219 .033 0 0 0 0

10–14 .019 .312 .457 .207 .037 0 0 0

15–19 0 .018 .286 .489 .275 .050 0 0

20–24 0 0 .017 .252 .450 .318 .062 0

25–29 0 0 0 .018 .214 .367 .325 .046

30–34 0 0 0 0 .024 .257 .528 .587

35–40 0 0 0 0 0 .008 .085 .367

7 Professional standards indicate that the CSEM is an important piece

of information that should be provided in a test manual. For example,

Standard 2.14 ‘‘Conditional standard error of measurements should be

reported at several score levels if constancy cannot be assumed.

Where cut scores are specified for selection or classification, the

standard errors of measurement should be reported in the vicinity of

each cut score.’’ (American Educational Research Association/

American Psychological Association, 1999; p. 35 emphasis added).

264 Law Hum Behav (2010) 34:259–274

123

evaluate the true level of agreement of diagnosis likely to

apply around a cut-off it is necessary to take the CSEM into

account.

Item Response Theory indicates that the error of measurement

varies with location on the trait (h).

IRT gives

SEðhÞ ¼

1

ffiffiffiffiffiffiffiffi

IðhÞ

p

where I(h) is the information at h.

CTT gives

SEM ¼ SD

ffiffiffiffiffiffiffiffiffiffiffi

1 q

p

Let q1 be the correlation at location 1 (h1), q2 be the

correlation at location 2 (h2).

Then

q2 ¼ 1 ð1 q1Þ

Iðh1Þ

Iðh2Þ

Location 1 is h = 0.0 (PCL-R = 20) and Location 2 is

h = 1.0 (PCL-R = 30) (Approximate locations from Hare,

2003; Fig. 6.6; see also Cooke & Michie, 1997). Overall,

the impact of the location of the estimated ICC1 is limited,

dropping—at a maximum—from .75 to .69. However, as

noted above, even small drops in ICC1 (e.g., from .85 to

.80) can substantially affect the misclassification rate and

the range of likely score discrepancies (see Table 6). It is

noteworthy that the magnitude of the drop appears to be

proportionately larger the poorer the mean estimated level

of inter-rater reliability. This suggests that the effect of the

CSEM is larger in cases that start with a relatively poor

level of inter-rater reliability. Equally, this would suggest

that proportionately greater discrepancies would, in

general, be obtained when factor or facet scores are

considered because they have lower levels of reliability

than the total scores (Hare, 2003).

STUDY TWO

The use of the PCL-R in court is frequently justified based

on its predictive utility, the support being garnered from

between-subject designs (Edens & Petrila, 2006; Hare,

2003; Walsh & Walsh, 2006). In this study, we are concerned

with the individual. We examine the confidence that

can be placed in a prediction that an individual with a

particular PCL-R score will be reconvicted for a violent

offence.

All measurements and estimates entail error. As noted

above, the degree of error is expressed by CIs. For

Table 4 Cumulative distribution of expected discrepancies between two raters for different levels of correlation based on two sample

distributions

Point discrepancy SEMa North American male offenders United Kingdom prisoners

Correlation Correlation

0.75 0.80 0.85 0.90 0.75 0.80 0.85 0.90

0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

1 .741 .934 .927 .919 .901 .936 .928 .918 .900

2 .503 .804 .785 .758 .704 .804 .785 .758 .703

3 .317 .679 .647 .595 .518 .681 .647 .595 .518

4 .303 .560 .516 .453 .352 .561 .518 .454 .357

5 .095 .449 .397 .327 .217 .452 .400 .332 .222

6 .046 .349 .292 .215 .121 .353 .298 .221 .125

7 .020 .262 .205 .136 .058 .269 .212 .141 .063

8 .007 .190 .137 .079 .023 .196 .142 .083 .026

9 .002 .132 .086 .041 .007 .137 .090 .044 .009

10 .001 .088 .050 .018 .002 .091 .054 .021 .003

11 .055 .026 .007 .058 .030 .009 .001

12 .032 .012 .002 .035 .015 .004

13 .017 .005 .019 .007 .001

14 .008 .002 .010 .003

15 .003 .005 .001

16 .001 .002

17 .001

a This column shows the cumulative distribution of discrepancies which was calculated assuming that discrepancies between two raters are

normally distributed and that the SEM is 3 (Hare, 2003, pp. 66–67)

Law Hum Behav (2010) 34:259–274 265

123

example, while the mean rate of reoffending for a ‘‘High

Risk’’ group may be estimated as being 55%; the 95% CI

indicates that the true value of the mean rate or reoffending

for this group will lie between 44% and 66%, 95% of the

time, i.e., 19 times out of 20 (Hart, Michie, & Cooke,

2007). However, the clinician and the decision maker are

interested in the individual case not the group. Therefore,

how much confidence can the clinician and decision maker

have in predictions of reoffending in the individual case

based on PCL-R scores? We examine CIs for group and

individual predictions.

Participants

Two hundred fifty-five male prisoners between 18 and

40 years of age (M = 26.8, SD = 5.9) were interviewed in

Scotland’s largest prison for a study of psychological

characteristics and violence (Cooke, Michie, & Ryan,

2001; Michie & Cooke, 2006). Prisoners were selected by

systematic random sampling of the prison. The average

sentence length was 39 months (SD = 23 months; range

= 3 months to 10 years and life).

PCL-R Ratings

PCL-R ratings were made according to instructions in the

test manual (Hare, 1991). All PCL-R evaluations were

conducted by trained raters using both interview and file

review (ICC1 = .86).

Assessment of Recidivism

Reconviction data were obtained from two sources: The

Scottish Criminal Records Office (SCRO) and the Police

National Computer (PNC). The average follow-up period

was 29 months. The point-biserial correlation between

PCL-R scores and recidivism (r = .31) was above average

for the field (Walters, 2003). For the purposes of illustration,

we consider reconviction for violence that resulted in

a prison sentence (i.e., generally a more serious violent

offence). Follow-up data were available for 190 cases and

PCL-R data for 184 of these.

Table 5 The 68% and 95% confidence intervals for 2nd PCL-R total

score given 1st PCL-R score and ICC = 0.8

1st PCLR

Prisoners UK

LL

.95

LL

.68

UL

.68

LL

.95

LL

.95

LL

.68

UL

.68

LL

.95

0 0 0 9 12 0 0 8 13

1 0 0 10 13 0 0 9 14

2 0 1 11 14 0 0 10 15

3 0 2 12 15 0 1 11 16

4 0 3 12 15 0 1 12 16

5 0 4 13 16 0 2 12 17

6 0 4 14 17 0 3 13 18

7 0 5 15 18 1 4 14 19

8 1 6 16 19 2 5 15 20

9 2 7 16 19 2 5 16 20

10 3 8 17 20 3 6 16 21

11 4 8 18 21 4 7 17 22

12 4 9 19 22 5 8 18 23

13 5 10 20 23 6 9 19 24

14 6 11 20 23 6 9 20 24

15 7 12 21 24 7 10 21 25

16 8 12 22 25 8 11 21 26

17 8 13 23 26 9 12 22 27

18 9 14 24 27 10 13 23 28

19 10 15 24 27 10 13 24 28

20 11 16 25 28 11 14 24 29

21 12 16 26 29 12 15 25 30

22 12 17 27 30 13 16 26 31

23 13 18 28 31 14 17 27 32

24 14 19 28 31 14 17 28 32

25 15 20 29 32 15 18 28 33

26 16 20 30 33 16 19 29 34

27 16 21 31 34 17 20 30 35

28 17 22 32 35 18 21 31 36

29 18 23 32 35 18 21 32 36

30 19 24 33 36 19 22 32 37

31 20 24 34 37 20 23 33 38

32 20 25 35 38 21 24 34 39

33 21 26 36 38 22 25 35 40

34 22 27 36 39 22 25 36 40

35 23 28 37 40

36 24 28 38 40 24 27 37 40

37 24 29 39 40 25 28 38 40

38 25 30 40 40 26 29 39 40

39 26 31 40 40

40 27 32 40 40

Table 6 Values of conditional standard error of measurement at

diagnostic cut-off of 30 for different values of SEM and distributions

of the three samples

SEM q1 CSEM q2

North American male prisoners 0.75 0.70

0.80 0.76

0.85 0.82

0.90 0.88

United Kingdom prisoners 0.75 0.67

0.80 0.74

0.85 0.80

0.90 0.87

266 Law Hum Behav (2010) 34:259–274

123

Analysis

There are standard methods for estimating CIs for groups;

however, methods for estimating CIs for the individual

case are not generally covered in the standard statistical

texts used in psychology and they may, we suspect, be

unfamiliar to the majority of psychologists. We explicate

the method here. First, we consider the general case of CI

estimation before considering the specific approach based

on linear logistic regression used for our analysis.

Any CI has the general form:

Estimate t (Estimate of Standard error)

where t is the Student’s t-statistic with the appropriate

degrees of freedom.

Suppose we are interested in a single variable, e.g.,

x = IQ, and have taken a sample of size n (x1; x2; . . .; xn) to

estimate the mean and variance of IQ in the population of

interest, then the sample mean ðxÞ is the estimate of the

population mean. The accuracy of this estimate is given by

a CI

x tn

ffiffiffiffi

s2

n

r

where s is an estimate of the standard deviation of x.

Suppose we are now interested in predicting the next

observation in the population, xn?1. Then a CI for the

prediction (i.e., the prediction interval) is given by

x tn

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

s2 1 þ

1

n

s

:

Note that the estimate of both the mean and the

prediction is x but that the prediction interval is (much)

wider than the CI for the mean. Note also that the size of

the sample from which the model was derived has little

influence on the width of the prediction interval.

In the linear regression situation, we have a sample of n

pairs of observations (ðx1; y1Þ; ðx2; y2Þ; . . .; ðxn; ynÞ) from

which we estimate the intercept and slope of the line by B0

and B1 in the usual way. The accuracy of estimation of the

line would be given by the CI for the mean y for a given x.

This is calculated in the standard manner (Steel, Torrie, &

Dickey, 1997).

yL; yU ¼ B0 þ B1x tn

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

^r2

1

n

þ

ðx x2Þ

SSðXÞ

s

If we have a new case for which we know the x-value,

xn?1 and wish to predict the y-value, this is given by

^ynþ1 ¼ B0 þ B1xnþ1

which is the mean value of y for the given x. The CI for this

prediction (i.e., the prediction interval) is given by

yL; yU ¼ B0 þ B1xnþ1 tn

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

^r2 1 þ

1

n

þ

ðxnþ1 x2Þ

SSðXÞ

s

Again this prediction interval is much wider than the

interval for the line and is not influenced to any significant

degree by the size of the sample from which the model was

developed. The square root term is the standard error of the

predicted value. Here, the expression in brackets takes into

account three sources of error. The first is the variability in

participants, the second is the error in the estimate of

variance ð^r2Þ; and the third allows for the fact that the error

in prediction varies with distance from the mean PCL-R

score.

Linear logistic regression is the appropriate method for

modeling the prediction of a binary outcome (e.g.,

reconviction). In linear logistic regression the model is

given by

PrðeventÞ ¼

1

1 þ eZ

where

Z ¼ B0 þ B1x

We have a linear regression of Z on x so the equation for

the CI for Z is the same as the linear regression case. A

prediction interval for Z for a new individual from the same

population with score x0 can be constructed by ZL and ZU

(lower and upper values, respectively) from the equation

ZL; ZU ¼ B0 þ B1x0 tn

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

^r2 1 þ

1

n

þ

ðx0 xÞ2

SSðXÞ

!

v

u

u

t

and the interval is then transformed in the following

manner. Since the probability is a monotonic function, the

prediction interval for the probability is given by

1

1 eZL

;

1

1 þ eZU

(note, once again, that the size of sample on which the

model was developed has little influence on the width of

the prediction interval).

Initially, we estimated the linear regression of Z on the

PCL-R total score (Fig. 2), and then estimated the mean

probability of reconviction by PCL-R score (Fig. 3). This is

a monotonic function with the probability of reconviction

accelerating with increasing PCL-R score. Those with an

average PCL-R score (i.e., 12.5) had a 14% probability of

being reconvicted for a violent crime and being sentenced

to prison. Examination of the 95% CIs for the estimate of

the mean rate of reconviction indicated that for an average

PCL-R score of 12.5 the true probability of reconviction

was between 10% and 20% (19 times out of 20) and for a

PCL-R score of 25 the 95% CI was 18–54%. For a score of

Law Hum Behav (2010) 34:259–274 267

123

30, the 95% CI was 21–70% demonstrating that 95% CIs

generally widen the further scores are from the mean.

Fundamentally, however, within the clinical or judicial

context, the individual—not the group—is the focus of

decision making. Therefore, we estimated the CIs for the

likelihood that an individual would be reconvicted using

the method outlined above. For an individual with a mean

PCL-R score of 12.5, the best estimate was that he will

reoffend 14% of the time; but this estimate was very

imprecise because the 95% CI was 0–98%, i.e., the true

value of the prediction would lie in this range 95% of the

time. For an individual with a score of 25, the 95% CI was

0–99% and for an individual with a score of 30 the 95% CI

was 0–99.5%.

To illustrate the extent of the uncertainty associated with

an individual prediction, we calculated the probability

density function associated with a prediction that an individual

with a PCL-R score of 25 would return to prison

within 2 years for a violent crime (point estimate .33). The

probability function is a means of describing degree of

uncertainty. It can be viewed as a smoothed version of a

histogram depicting relative frequencies of the range of

probabilities of reoffending consequent on the variability in

the original sample. Figure 4 displays a relatively flat

probability density function around the point estimate of

.33, with values ranging from 0 to 1.0, indicating that a

broad range of values is likely in any individual case.

One anonymous reviewer made the compelling case that

a more liberal definition of harmful behavior that included

more forms of offending should be considered. We carried

out the same analyses with three other outcome variables,

i.e., convictions for any crime or offence within 2 years of

release; any convictions leading to incarceration and any

conviction for a violent crime over the same time period.

Figure 5 displays results for any convictions within 2 years

of release and reveals the same pattern as for violent crime:

The CI associated with the regression line being much

narrower than the prediction interval.8

Another anonymous reviewer suggested that our results

might be due to sample size (this is highly unlikely given

the mathematical basis for the analysis, see above) or

because of where the sample was drawn. We carried out

further analyses to clarify these points using data from the

MacArthur study of Mental Disorder and Violence (Monahan

et al., 2001). Psychopathy as measured using the

Psychopathy Checklist: Screening Version (PCL:SV; Hart,

Cox, & Hare, 1995) was the strongest risk factor for future

violence in that study, i.e., violence in the 20 weeks following

discharge; the sample size with PCL:SV ratings was

over four times the Scottish sample (n = 860). The pointbiserial

correlation between PCL:SV scores and recidivism

was similar to the equivalent correlation in the Scottish

sample (r = 0.34, cf. r = 0.31). Figure 6 displays essentially

the same pattern as the Scottish data: The slope is

similar in shape to the Scottish curve, a slowly accelerating

curve with risk of violence increasing with PCL:SV score.

As would be expected from consideration of the equations

above increasing the sample size (among other things) has

resulted in a narrower CI around the regression line. But

critically—as would be expected from the mathematics—

increasing the sample size does not result in a narrower

prediction interval.

Perhaps a nonpsychological example may facilitate

explanation. Given someone’s height, how well can we

predict his or her weight? This example has several

advantages for the purpose of illustrating the pervasive

nature of the problem of predicting in the individual case.

First, the reliability of the measurement of height and

0 8 16 24 32 40

PCL Total Score

0

3

6

9

-3

-6

-9

Z

Z

Mean

Prediction

Fig. 2 Group and individual CIs for linear regression of Z on PCL-R

Total Score

0 8 16 24 32 40

PCL-R total score

0

0.2

0.4

0.6

0.8

1

Probability of recidivism

Probability

Mean

Prediction

Fig. 3 Group and individual CIs around of prediction of violent

reoffending resulting in return to prison based on PCL-R score

8 Detailed descriptions of the results for the additional three outcome

variables can be obtained from the first author.

268 Law Hum Behav (2010) 34:259–274

123

weight should be substantially higher than the measurement

of either psychopathy or violent behavior. Second,

the prediction is immediate and not degraded by the passage

of time. Third, the relationship between height and

weight is stronger than that between psychopathy and

violent behavior. How well can we predict in the individual

case under these more benign conditions? We carried out a

Monte Carle simulation9 based on two sets of findings: The

height of males in the UK is normally distributed (Guilford,

Rona, & Chinn, 1992); and the relationship between height

and weight can be assumed to be linear (Hawthorne,

Murdoch, & Womersley, 1979). Figure 7 presents the linear

regression of weight on height with a sample of 2000.

The CI of the regression line is very narrow; however,

when the prediction interval is calculated for an individual

it is very wide. For example, for an individual of average

height (i.e., 1.75 m) his predicted weight would be 81.5 kg

but the prediction interval is between 61.3 and 101.8 kg—a

range of around 40 kg.

In conclusion, the results demonstrate that PCL-R (and

PCL:SV) scores provide little reliable information about the likelihood that an individual will reoffend violently.10

This is not a problem peculiar to the PCL-R but will reflect

individual variability on any scale (e.g., VRAG, Quinsey

et al., 1998; Static-99, Hanson & Thornton, 1999; COVR,

Monahan et al., 2005).

DISCUSSION

One broad conclusion can be drawn from these two studies:

Clinicians must be extremely cautious in what they claim

0 8 16 24 32 40

PCL-R total score

0

0.2

0.4

0.6

0.8

1

Probability of reconviction

Mean

Prediction

Prob

Fig. 5 Group and individual CIs around of prediction of any

reoffending based on PCL-R score

0 5 10 15 20

PCL:SV

0.0

0.2

0.4

0.6

0.8

1.0

Probability

Probability of violence

Confidence interval for line

Confidence interval for prediction

Fig. 6 Group and individual CIs around the prediction of violence in

the 20 weeks following discharge: Data from MacArthur Study of

Mental Disorder and Violence

0 0.2 0.4 0.6 0.8 1

Probability of recidivism

0

0.03

0.06

0.09

0.12

0.15

Probability density

Fig. 4 Probability density function of the probability of a return to

prison on conviction of a violent crime for a PCL-R score of 25

9 Following Guilford et al. (1992), the height of adult males in the

UK between 1973 and 1988 was shown to be normally distributed

with a mean of approximately 1.75 m (SD = 0.07). The relationship

between weight and height for men aged 40–59 can be shown to be

linear with the relationships for nonsmokers being Weight = 82.7

Height = 63.4 with r = 0.50 (Hawthorne et al., 1979). A sample of

2,000 pairs of height and weight were generated using Mathcad. For

each subject, a height H was generated from a N(1.75, 0.49)

distribution. A predicted weight was then calculated; a weight W was

generated by adding an error from a N(0,98.4) distribution. The

correlation between height and weight for this sample of 2,000 cases

did not differ significantly from that reported in Hawthorne et al.

(1979) (0.50 vs. 0.499). The linear regression of weight on height

together with the CI for the regression line and the prediction interval

for an individual whose height was known were calculated and

presented in Fig. 7. Linear regression rather than logistic regression

was used because both variables are continuous. The basic distinction

between confidence intervals and prediction intervals remains the

same.

10 We do not consider additional issues that would add ‘‘noise’’ into

the system including recalibration of the PCL-R in a new jurisdiction

in terms of the metric equivalence—or otherwise—of the scores

(Cooke et al., 2005), the differences of reliability in clinical practice

against research settings, and variations in the predictive validity of

the PCL-R in a setting where detection and conviction rates may be

different, etc.

Law Hum Behav (2010) 34:259–274 269

123

regarding diagnoses, numerical scores, and risk potential of

individual clients based merely on a PCL-R score. First,

allocation above and below key diagnostic cut-offs (i.e., 30

or 25 on the PCL-R) is subject to far greater variability than

previously demonstrated. Second, the precision of numerical

scores is less than previously considered. Third, the

clinician can have little confidence in statistical predictions

regarding an individual’s likelihood of future offending

based on a PCL-R score or the scores of violence risk

assessment instruments. Fourth, the concatenation of these

two sources of imprecision—score and predictive—is

likely to further intensify uncertainty about what any one

individual will do in the future.

We emphasize again that these problems are not unique

to the PCL-R: The shape of underlying score distributions

will influence the precision of any scores estimated or any

diagnoses derived. Statistical predictions about individuals

will always be poor (Hart et al., 2007). As noted above, all

psychological tests used in the same way in the forensic

arena may suffer from similar limitations (e.g., VRAG,

Quinsey et al., 1998; Static-99, Hanson & Thornton, 1999;

COVR, Monahan et al., 2005).

Neither are these problems unique to psychology. They

bedevil—as our height and weight example demonstrates—

any attempts to use group data to predict individual outcomes

accurately, whether the outcome is, for example,

heart attacks, cancer, juvenile delinquency, or recidivism

(Copas & Marshall, 1998; Elmore & Fletcher, 2006; Rose,

1992; Scott, 2003). The problem reflects inherent human

variation.

There are perhaps two broad findings to note when it

comes to considering the precision of our estimates of trait

strength (or indeed, diagnosis). First, the use of aggregate

statistics (e.g., Kappa or ICC1) to measure agreement, or to

infer precision of our measurement processes, can obscure

clinically important imprecision at the level of the individual.

Second, untested assumptions (e.g., that scores are

normally distributed) can be misleading when it comes to

estimating the precision of our estimates. The findings from

the Monte Carlo study (Study 1, described above) whether

expressed in terms of diagnostic agreement, score disagreement,

or range of score discrepancies may provide

some explanation for the growing evidence of clinically

significant discrepancies in PCL-R ratings in forensic settings

(Boccaccini, Turner, & Murrie, 2008; Edens &

Petrila, 2006; Murrie et al., 2008; Murrie, Boccaccini,

Turner, et al., in press).

Ethical forensic practice requires practitioners to maximize

their reliability. There are no panaceas but four steps

may assist. The first step is ongoing education and training,

not only regarding the research base of tests and measures

used in forensic practice, but also regarding advanced

clinical skills. Advanced clinical skills would include

techniques for interviewing these challenging individuals to

ensure the collection of relevant information; these

skills would also include techniques for generating case

formulations to ensure the appropriate application of the

information collected (Cooke, 2008, 2009a; Logan &

Johnstone, 2008). The second step is ensuring the availability

of comprehensive file information. The quality of file

information influences both the magnitude and reliability of

scores (Alterman, Cacciola, & Rutherford, 1993). The third

step is the use of multiple raters in high stakes cases; average

ratings should be eschewed, consensus ratings should be

sought. The fourth step is the implementation of audit systems—

including peer review—for the detection of rater

drift (Cooke, 2009b).

Deriving Inferences About Individuals

from Inferences About Groups

We recognize that some of our conclusions may be surprising—

perhaps even controversial—as there is a

widespread acceptance of the prediction paradigm. However,

should we be surprised that we find it difficult to

predict what any individual will do in the future? Consider

just some of the factors that affect predictive accuracy: The

lack of reliability in the predictor and outcome variables;

the relative weakness of the association between these

variables; the inherent variability across individuals—and

within individuals and their circumstances across time—

and the multitudinous causes that result in violent crime.

Perhaps we have become over-confident. Studies of judgment

under uncertainty have indicated human tendencies

both to be overconfident in predictions (Kahneman &

Tversky, 1973) and overly narrow in CI estimates (Alpert

& Raiffa, 1982). Professionals are not immune from these

biases.

1.6 1.7 1.8 1.9

Height (m)

40

60

80

100

120

Weight (kg)

Predicted Weight

Mean

Prediction

Fig. 7 Group and individual CIs around the prediction of weight

from knowledge of an individual’s height: Simulation with n = 2,000

270 Law Hum Behav (2010) 34:259–274

123

The findings we present about predictions in individual

cases reflect a problem of inference that is long recognized

in psychology and other disciplines more generally (e.g.,

Altman & Royston, 2000; Henderson & Keiding, 2005).

Discussing child development, Lewin (1931; as cited in

Richters, 1997) noted ‘‘An inference from the average to

the particular case is … impossible.’’ (Richters, 1997, p.

199). Discussing the medical application of prognostic

models, Altman and Royston (2000) noted ‘‘…the distinction

between what is achievable at the group and

individual levels is not well understood’’ (p. 454). The

problem pertains even under ideal conditions: Henderson

and Keiding (2005), discussing survival time prediction in

relation to virulent non-small-cell lung cancer, indicated

‘‘…the intrinsic statistical variations in life times are so

large that predictions based on statistical models and

indices are of little use for individual patients. This applies

even when the prognostic model is known to be true and

there is no statistical uncertainty in parameter estimation’’

(p. 703). Why is this so?

Confidence Intervals and Prediction Intervals

As we indicated in our exegesis of the statistical principles

underlying this problem, the CIs for model parameters are

different from the CIs around the prediction for a new case.

The latter always being substantially wider than the former.

Also, prediction intervals are little influenced by the size of

the sample used to develop the statistical model (Steel

et al., 1997). Collecting bigger samples is not a solution.

We demonstrated this empirically by contrasting the

Scottish sample with the MacArthur sample.

The distinction between CIs and prediction intervals is

made in other areas of assessment, e.g., intelligence testing.

An example may demonstrate the pervasiveness of the

prediction problem when applied to the individual. The

Wechsler Abbreviated Scale of Intelligence (WASI; Psychological

Corporation, 1999) is a brief test of intellectual

functioning, which can be used to predict an individual’s

performance on the ‘‘gold standard’’ Wechsler Intelligence

Scale for Children—Third Edition (WISC-III; Wechsler,

1991). Note that these are very reliable tests, note also, they

measure within the same conceptual domain (using very

similar procedures), and that the correlation between the two

tests is very high (Full-scale IQ r = .87). An individual

assessed on the WASI with a Full Scale IQ of 70 (90% CI

66–76) will have a predicted WISC-III Full Scale IQ of 70

(90% Prediction Interval 62–87; Psychological Corporation,

1999). Thus, even in these ideal conditions—the same

conceptual domain, highly reliable tests that are highly

correlated—the prediction interval is 2.5 times greater than

the equivalent CI. It is not surprising that the difference

between the CI and the prediction interval is even greater

when the link between PCL-R scores and future violence is

considered. In the Scottish sample, for a mean PCL-R score

the best estimate of the probability of reoffending violently

is 14%, the CI is between 10% and 20%, whereas the prediction

interval is between 0% and 98%. In this case, the

prediction interval is almost ten times the CI.

This problem of moving from the general to the specific

is not merely a matter of statistics; it is also a matter of

logic (Haje´k & Hall, 2002; Hart et al., 2007). The application

of between-subject information to guide withinsubject

causal inference is subject to the logical fallacy of

division (Rorer, 1990). One form of this fallacy rests on

drawing an invalid conclusion about an individual member

of a group based on the collective properties of the group.

For example, it is obviously fallacious to argue that if, in

general, intelligent people earn more than less intelligent

people then Jules, with an IQ of 120, will earn more than

Jim with an IQ of 100. Equally, it is fallacious to argue that

although, in general, people who score highly on the PCLR

re-offend more than people who do not score highly, Bill

with an PCL-R score of 30 will re-offend more often than

Brian with a PCL-R score of 10. A common defense of the

actuarial approach is founded upon this fallacy. ‘‘If it is

alright for life insurance companies, it should be alright for

psychology.’’ The analogue is false. The actuary makes a

profit by predicting the proportion of insured lives that will

end in a particular time period: The actuary has no interest

in predicting the deaths of particular individuals.

There is a growing awareness in psychology that

between-subject models cannot test or support causal

accounts (e.g., pertaining to earning potential or violence)

that are valid at the individual level (Borsboom, Mellenbergh,

& van Heeran, 2003; Richters, 1997). With a

between-subjects design it is possible to argue legitimately

that within population differences in psychopathy can cause

differences in population differences in violent reoffending.

However, this position cannot be defended at the level of the

individual; this is because there is an unspoken assumption

that the mechanisms that operate at the level of the individual

also explain variations between individuals. Richters

(1997) clarified the basis of the problem:

The extraordinary human capacity for equifinal and

multifinal functioning, however, render the structural

homogeneity assumption untenable. Very similar

patterns of overt functioning may be caused by

qualitatively differing underlying structures both

within the same individual at different points in time,

and across different individuals at the same time

(equifinality) (pp. 206–207).

Individuals are violent for different reasons: Any one

individual may be violent for different reasons on different

occasions.

Law Hum Behav (2010) 34:259–274 271

123

In summary, on the basis of empirical findings, statistical

theory, and logic it is clear that predictions of future

offending cannot be achieved, with any degree of confidence,

in the individual case.

CONCLUSION

We emphasize again that the problems identified in this

article are not unique to the PCL-R. In some sense our

ability to demonstrate these problems with the PCL-R is a

reflection of the success of this test: It is used extensively

and thus large datasets are available; It has been subject to

considerable psychometric evaluation. Other tools used in

forensic settings will be subject to similar limitations. For

example, the precision with which individuals can be

allocated to risk ‘‘bins’’ by actuarial risk tools is influenced

by the reliability of scoring and the underlying distribution

of scores. Faigman (2007) argued that psychology has

ignored the problem of translating scientific research into

findings that help triers of fact; he indicates that psychology

has to take on the ‘‘monumental intellectual challenge’’

(p. 313) of making the inferential leap between populationlevel

findings and individual-level findings relevant to

courts. We ignore this challenge at our peril. Tentative

steps toward meeting this challenge are discussed elsewhere

(Cooke, 2009b).

This article is not without limitations. First, it is based on

males. We know little about the reliability of the diagnosis

and predictive utility in females or, indeed, whether the

instrument functions adequately in females or other populations

(Forouzan & Cooke, 2005; Verona & Vitale, 2006).

Second, the study is focused on adults. The potential for lifechanging

decisions may be even greater when related procedures

are applied to adolescents; less information is

generally available to make a diagnosis in adolescents

(Edens & Petrila, 2006). The methods we used are explicated

in detail in this article so that others can apply them to

their own—hopefully diverse—datasets.

REFERENCES

Alpert, M., & Raiffa, H. (1982). A progress report on the training of

probability assessors. In D. Kahneman, P. Slovic, & A. Tversky

(Eds.), Judgment under uncertainty: Heuristics and biases (pp.

294–305). New York: Cambridge University Press.

Alterman, A. I., Cacciola, J. S., & Rutherford, M. J. (1993). Reliability

of the Revised Psychopathy Checklist in substance abuse patients.

Psychological Assessment, 5, 442–448. doi:10.1037/1040-3590.

5.4.442.

Altman, D. G., & Royston, P. (2000). What do we mean by validating

a prognostic model? Statistics in Medicine, 19, 453–473. doi:

10.1002/(SICI)1097-0258(20000229)19:4\453::AID-SIM350[

3.0.CO;2-5.

American Educational Research Association/American Psychological

Association. (1999). Standards for educational and psychological

testing. Washington, DC: American Educational Research

Association.

Boccaccini, M. T., Turner, D. B., & Murrie, D. C. (2008). Do some

evaluators report consistently higher or lower PCL-R scores than

others? Findings from a statewide sample of sexually violent

predator evaluations. Psychology, Public Policy, and Law, 14,

262–283. doi:10.1037/a0014523.

Borsboom, D., Mellenbergh, G. J., & van Heeran, J. (2003). The

theoretical status of latent variables. Psychological Review, 110,

203–219. doi:10.1037/0033-295X.110.2.203.

Bradfield, R. B., Huntzickler, P. B., & Fruehan, G. J. (1970). Errors of

group regression for prediction of individual energy expenditure.

The American Journal of Clinical Nutrition, 23, 1015–1016.

Colditz, G. A. (2001). Cancer culture; epidemics, human behavior,

and the dubious search fro new risk factors. American Journal of

Public Health, 91, 357–359. doi:10.2105/AJPH.91.3.357.

Cooke, D. J. (2008). Psychopathy as an important forensic construct:

Past, present and future. In D. Canter & R. Zukauskiene (Eds.),

Psychology, crime & law. New horizons—International perspectives.

Aldershot: Ashgate.

Cooke, D. J. (2009a). Psychopathy. In E. A. Campbell & J. Brown

(Eds.), Cambridge handbook of forensic psychology. Cambridge:

Cambridge University Press.

Cooke, D. J. (2009b). Strengths and limitations of the Psychopathy

Checklist Revised (PCL-R) in courts and other tribunals (Paper

under preparation).

Cooke, D. J., & Michie, C. (1997). An Item Response Theory

evaluation of Hare’s Psychopathy Checklist. Psychological

Assessment, 9, 2–13. doi:10.1037/1040-3590.9.1.3.

Cooke, D. J., Michie, C., & Hart, S. D. (2006). Facets of clinical

psychopathy: Towards clearer measurement. In C. J. Patrick

(Ed.), Handbook of psychopathy (pp. 91–106). New York: The

Guilford Press.

Cooke, D. J., Michie, C., Hart, S. D., & Clark, D. (2005). Assessing

psychopathy in the United Kingdom: Concerns about crosscultural

generalisability. The British Journal of Psychiatry, 186,

339–345. doi:10.1192/bjp.186.4.335.

Cooke, D. J., Michie, C., & Ryan, J. (2001). Evaluating risk for

violence: A preliminary study of the HCR-20, PCL-R and VRAG

in a Scottish prison sample. Edinburgh: Scotland Office.

Copas, J., & Marshall, P. (1998). The offender group reconviction

scale: A statistical reconviction score for use by probation

officers. Applied Statistics, 47, 159–171. doi:10.1111/1467-9876.

00104.

DeMatteo, D., & Edens, J. F. (2006). The role and relevance of the

Psychopathy Checklist-Revised in court. A case law survey of

U.S courts (1991–2004). Psychology, Public Policy, and Law,

12, 214–241. doi:10.1037/1076-8971.12.2.214.

Douglas, K. S., Vincent, G. M., & Edens, J. F. (2006). Risk for

criminal recidivism: The role of psychopathy. In C. J. Patrick

(Ed.), Handbook of psychopathy (pp. 533–554). New York: The

Guilford Press.

DSPD Programme. (2005). Dangerous and Severe Personality

Disorder (DSPD) High Secure Services for Men. London: DSPD

Programme, Department of Health, Home Office, HM Prison

Service.

Edens, J. F., & Petrila, J. (2006). Legal and ethical issues in the

assessment and treatment of psychopathy. In C. J. Patrick (Ed.),

Handbook of psychopathy (pp. 573–588).New York: The Guilford

Press.

Elmore, J. G., & Fletcher, S. W. (2006). The risk of cancer risk

prediction: ‘‘What is my risk of getting breast cancer? Journal of

the National Cancer Institute, 98, 1673–1675.

272 Law Hum Behav (2010) 34:259–274

123

Faigman, D. L. (2007). The limits of science in the courtroom. In E.

Borgida & S. T. Fiske (Eds.), Beyond common sense: Psychological

science in the courtroom (pp. 303–313). Oxford:

Blackwell.

Fitch, W. L., & Ortega, R. J. (2000). Law and the confinement of

psychopaths. Behavioral Sciences & the Law, 18, 663–678. doi:

10.1002/1099-0798(200010)18:5\663::AID-BSL408[3.0.CO;2-V.

Fleiss, J. L. (1981). Statistical methods for rates and proportions.

New York: Wiley.

Forouzan, E., & Cooke, D. J. (2005). Figuring out la femme fatale:

Conceptual and assessment issues concerning psychopathy in

females. Behavioral Sciences and the Law, 23, 765–778.

Gail, M. H., & Benichou, J. (2000). Encyclopedia of epidemiological

methods. Chichester: Wiley.

Guilford, M. C., Rona, R. J., & Chinn, S. (1992). Trends in body mass

index in young adults in England and Scotland from 1973 to 1988.

Journal of Epidemiology and Community Health, 46, 187–190.

doi:10.1136/jech.46.3.187.

Haje´k, A.,&Hall, N. (2002). Induction and probability. In P. Machamer

& M. Silberstein (Eds.), Blackwell guide to the philosophy of

science (pp. 149–172). Oxford: Blackwell.

Hanson, R. K., & Thornton, D. M. (1999). Static 99: Improving

actuarial risk assessments for sex offenders. Ottawa: Public Works

and Government Services Canada.

Hare, R. D. (1991). The Hare Psychopathy Checklist—Revised (1st

ed.). Toronto: Multi-Health Systems.

Hare, R. D. (1993). Without conscience: The disturbing world of the

psychopaths among us (1st ed.). New York: Pocket Books.

Hare, R. D. (1998). The Hare PCL-R: Some issues concerning its use

and misuse. Legal and Criminological Psychology, 3, 101–119.

Hare, R.D. (2003). TheHare Psychopathy Checklist—Revised (2nd ed.).

Toronto: Multi-Health Systems.

Hart, S. D. (1998). The role of psychopathy in assessing risk for

violence: Conceptual and methodological issues. Legal and

Criminological Psychology, 3, 121–137.

Hart, S. D. (2001). Forensic issues. In W. J. Livesley (Ed.), Handbook

of personality disorders: Theory, research, and treatment (pp.

555–569). New York: The Guilford Press.

Hart, S. D., Cox, D. N., & Hare, R. D. (1995). The Hare Psychopathy

Checklist: Screening version (1st ed.). Toronto: Multi-Health

Systems.

Hart, S. D., & Hare, R. D. (1997). Psychopathy: Assessment and

association with criminal conduct. In D. M. Stoff, J. Breiling, &

J. D. Maser (Eds.), Handbook of antisocial behavior (pp. 22–35).

New York: Wiley.

Hart, S. D., Michie, C.,&Cooke, D. J. (2007). The precision of actuarial

risk assessment instruments: Evaluating the ‘‘Margins of Error’’ of

group versus individual predictions of violence. The British

Journal of Psychiatry, 170(Suppl 49), 60–65. doi:10.1192/bjp.

190.5.s60.

Hawthorne, V. M., Murdoch, R. M., & Womersley, J. (1979). Body

weight of men and women aged 40–64 years from an urban area

in the West of Scotland. Community Medicine, 1, 229–235.

Hemphill, J. F., Hare, R. D., & Wong, S. (1998). Psychopathy and

recidivism: A review. Legal and Criminological Psychology, 3,

139–170.

Hemphill, J. F., & Hart, S. D. (2002). Motivating the unmotivated:

Psychopathy, treatment, and change. In M. McMurran (Ed.),

Motivating offenders to change: A guide to enhancing engagement

in therapy (pp. 193–220). Chichester: Wiley.

Henderson, R., Jones, M., & Stare, J. (2001). Accuracy of point

predictions in survival analysis. Statistics in Medicine, 20, 3083–

3096. doi:10.1002/sim.913.

Henderson, R., & Keiding, N. (2005). Individual survival time

prediction using statistical models. Journal of Medical Ethics,

31, 703–706. doi:10.1136/jme.2005.012427.

Kahneman, D.,&Tversky, A. (1973). On the psychology of prediction.

Psychological Review, 80, 237–251. doi:10.1037/h0034747.

Kennaway, R. (1998). Population statistics cannot be used for

reliable individual prediction. Retrieved October 12, 2006, from

http://citeseer.ist.psu.edu/328224.html.

Leistico, A. R., Salekin, R. T., DeCoster, J., & Rogers, R. (2008). A

large-scale meta-analysis relating the Hare measures of psychopathy

to antisocial conduct. Law and Human Behavior, 32,

28–45. doi:10.1007/s10979-007-9096-6.

Logan, C., & Johnstone, L. (2008). Personality disorders: Clinical and

risk formulations (Paper under review).

Lyon, D., & Ogloff, J. R. P. (2000). Legal and ethical issues in

psychopathy assessment. In C. B. Gacono (Ed.), The clinical and

forensic assessment of psychopathy (pp. 139–173). Mahwah, NJ:

Lawrence Erlbaum Associates.

Maden, A., & Tyrer, P. (2003). Dangerous and severe personality

disorders: A new personality concept from the United Kingdom.

Journal of Personality Disorders, 17, 489–496. doi:10.1521/pedi.

17.6.489.25356.

MATHCAD.13. (2005). Mathcad 13 user’s guide. Cambridge, MA:

Mathsoft Engineering and Education, Inc.

Michie, C., & Cooke, D. J. (2006). The structure of violent behavior: A

hierarchical model. Criminal Justice and Behavior, 33, 706–737.

doi:10.1177/0093854806288941.

Monahan, J., Steadman, H., Robbins, P. C., Appelbaum, P., Banks, S.,

Grisso, T., et al. (2005).An actuarialmodel of violence. Psychiatric

Services, 56, 810–815. doi:10.1176/appi.ps.56.7.810.

Monahan, J., Steadman, H., Silver, E., Appelbaum, P., Robbins, P. C.,

Mulvey, E. P., et al. (2001). Rethinking risk assessment: The

MacArthur study of mental disorder and violence (1st ed.). New

York: Oxford University Press.

Mooney, C. Z. (1997). Monte Carlo simulation. Thousand Oaks, CA:

Sage.

Murrie, D. C., Boccaccini, M. T., Johnson, J. T., & Janke, C. (2008).

Does interrater (dis)agreement on Psychopathy Checklist scores in

Sexually Violent Predator trials suggest partisan allegiance in

forensic evaluations? Law and Human Behavior, 32(4), 352–362.

doi:10.1007/s10979-007-9097-5.

Murrie, D. C., Boccaccini, M. T., Turner, D., Meeks, M., Woods, C.,

& Tussey, C. Rater (dis)agreement on risk assessment measures

in sexually violent predator proceedings: Evidence of adversarial

allegiance in forensic evaluation. Psychology, Public Policy, and

Law, in press.

Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd

ed.). New York: McGraw-Hill.

Psychological Corporation. (1999). Wechsler Abbreviated Scale of

Intelligence (WASI) manual. San Antonio, TX: Psychological

Corporation.

Quinsey, V. L., Harris, G. T., Rice, M. E., & Cormier, C. A. (1998).

Violent offenders: Appraising and managing risk (1st ed.).

Washington, DC: American Psychological Association.

Richters, J. E. (1997). The Hubble hypothesis and the developmentalist’s

dilemma. Development and Psychopathology, 9, 193–229.

doi:10.1017/S0954579497002022.

Robert, C. P. (2004). Monte Carlo statistical methods. New York:

Springer-Verlag.

Rockhill, B. (2001). The privatization of risk. American Journal of

Public Health, 91, 365–368. doi:10.2105/AJPH.91.3.365.

Rockhill, B., Kawachi, I., & Colditz, G. A. (2000). Individual risk

prediction and population-wide disease prevention. Epidemiologic

Reviews, 22, 176–180.

Rorer, L. (1990). Personality assessment: A conceptual survey. In L.

A. Pervin (Ed.), Handbook of personality: Theory and research

(pp. 693–720). New York: The Guilford Press.

Rose, G. (1992). The strategy of preventative medicine. Oxford:

Oxford Medical Publications.

Law Hum Behav (2010) 34:259–274 273

123

Salekin, R. T., Rogers, R., & Sewell, K. W. (1996). A review and

meta-analysis of the Psychopathy Checklist and Psychopathy

Checklist-Revised: Predictive validity of dangerousness. Clinical

Psychology: Science and Practice, 3, 203–215.

Scott, K. G. (2003). Commentary: Individual risk prediction, individual

risk, and population risk. Journal of Clinical Child and Adolescent

Psychology, 32, 243–245. doi:10.1207/S15374424JCCP3202_9.

Steel, R. G. D., Torrie, J. H., & Dickey, D. A. (1997). Principles and

procedures of statistics: A biometrical approach. New York:

McGraw Hill.

Tam, C. C., & Lopman, B. A. (2003). Determinism versus stochasticism:

In support of long coffee breaks. Journal of Epidemiology and

Community Health, 57, 478. doi:10.1136/jech.57.7.477.

Verona, E., & Vitale, J. (2006). Psychopathy in women: Assessment,

manifestations and etiology. In C. J. Patrick (Ed.), Handbook of

psychopathy (pp. 415–436). New York: The Guilford Press.

Wald, N. J., Hackshaw, A. K., & Frost, C. D. (1999). When can a risk

factor be used as a worthwhile screening test. British Medical

Journal, 319, 1562–1565.

Walsh, T., & Walsh, Z. (2006). The evidentiary introduction of the

Psychopathy Checklist-Revised assessed psychopathy in U.S.

courts: Extent and appropriateness. Law and Human Behavior,

30, 493–507. doi:10.1007/s10979-006-9042-z.

Walters, G. D. (2003). Predicting criminal justice outcomes with the

Psychopathy Checklist and Lifestyle Criminality Screening

Form: A meta-analytic comparison. Behavioral Sciences & the

Law, 21, 89–102. doi:10.1002/bsl.519.

Wechsler, D. (1991). Wechsler Intelligence Scale for Children (3rd

ed.). San Antonio, TX: The Psychological Corporation.

Zinger, I., & Forth, A. E. (1998). Psychopathy and Canadian criminal

proceedings: The potential for human rights abuses. Canadian

Journal of Criminology, 40, 237–276.

274 Law Hum Behav (2010) 34:259–274

123

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.