Use of Brier score to assess binary predictions
Article Outline
The use of the Brier score [1] in medical research to assess and compare the accuracy of binary predictions or prediction models is increasingly popular; see, for example Refs. [2], [3], [4], [5]. In [6, Box 1] an overview of a variety of measures of model performance is offered and [7] propose cutoffs for appraising the value of a computed score. How Brier scores can be formally compared is detailed in Ref. [8].
Because of a growing number of applications and in light of the description in [9, p. 1253], we would like to briefly discuss the Brier score and its connection to Spiegelhalter's calibration test [10].
For n predictive probabilities p
=
(p1,…,pn) with 0
≤
pi
≤
1 and n realizations x
=
(x1,…, xn) of Bernoulli random variables Xi
∼
Ber(πi) with
and xi
∈
{0, 1}, the Brier score, defined as
(1)In the decomposition (1) the first summand has expectation 0 under perfect calibration, that is, if
. This is exploited in the construction of Spiegelhalter's z-statistic [10], [13] that enables formal assessment of calibration of binary predictions. The z-statistic is defined as
(2)The null hypothesis of calibration, that is,
is rejected at the significance level α if
, where qα is the α-quantile of the standard normal distribution. This short summary makes it clear that
as suggested by [9, p. 1253]. Indeed, suppose two Bernoulli experiments are performed and two forecasters issue predictions p1
=
(0.2, 0.2) and p2
=
(0.4, 0.5), and x
=
(1, 0) materializes. The resulting Brier scores and values of Spiegelhalter's z-statistic for these two competing models are provided in Table 1.
Table 1. Value of Brier score and Spiegelhalter's z-statistic for the two models and the realization x
=
(1, 0)
| Model | p | B(p, x) | Z(p, x) |
|---|---|---|---|
| 1 | (0.2, 0.2) | 0.34 | 1.06 |
| 2 | (0.4, 0.5) | 0.30 | 1.22 |
According to the value of the Brier score, the second model is to be preferred over the first; however, because
this second model is less calibrated than the first. This simple example reveals that it is not generally true that a lower Brier score implies better model calibration. The reason is that the Brier score simultaneously addresses calibration and sharpness, see the discussion above. To exclusively address calibration, Spiegelhalter's z-test should be used. To us, it is therefore unclear what the purpose is of testing the Brier score on the value of 0, as indicated by [9, Table 2]. Instead, we hazard a guess that the authors actually intended to say that their models marked with a “∗” in [9, Table 2] are not well calibrated. However, this does not correspond to have a Brier score value that is significantly different from 0 but Spiegelhalter's z-statistic different from 0.
To conclude, we advocate the use of the Brier score to assess predictive accuracy of binary prediction models, and we also agree that calibration of such models is an important issue and should be addressed when comparing models via the Brier score. With this short note, we intended to clarify some aspects when using these tools.
References
- . Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78:1–3
- Lung: feasibility of a method for changing tube current during low-dose helical CT. Radiology. 2002;224:905–912
- . Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. J Clin Epidemiol. 2005;58:475–483
- Prediction of BRCA mutations using the BRCAPRO model in clinic-based African American, Hispanic, and other minority families in the United States. J Clin Oncol. 2009;27:1184–1190
- . Clinical prediction models. New York, NY: Springer; 2009;
- . Recalibration of risk prediction models in a large multicenter cohort of admissions to adult, general critical care units in the United Kingdom. Cri Care Med. 2006;34:1378–1388
- . Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001;54:774–781
- . Assessing predictive accuracy: how to compare Brier scores. J Clin Epidemiol. 1991;44:1141–1146
- Using multiple data features improved the validity of osteoporosis case ascertainment from administrative databases. J Clin Epidemiol. 2008;61:1250–1260
- . Probabilistic prediction in patient management and clinical trials. Stat Med. 1986;5:421–433
- . Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc. 2007;102:359–378
- . Scalar and vector partitions of the probability score: Part I. Two-state situation. J Appl Meteorol. 1972;11:273–282
- StataCorp. STATA, reference A-F. Stata Corporation; 2003.
PII: S0895-4356(09)00363-1
doi:10.1016/j.jclinepi.2009.11.009
© 2010 Elsevier Inc. All rights reserved.
Refers to article:
- Using multiple data features improved the validity of osteoporosis case ascertainment from administrative databases , 11 July 2008
