Journal of Clinical Epidemiology
Volume 63, Issue 8 , Pages 938-939, August 2010

Use of Brier score to assess binary predictions

Biostatistics Unit, Institute of Social and Preventive Medicine University of Zurich, Zurich, Switzerland

published online 02 March 2010.

Article Outline

 

The use of the Brier score [1] in medical research to assess and compare the accuracy of binary predictions or prediction models is increasingly popular; see, for example Refs. [2], [3], [4], [5]. In [6, Box 1] an overview of a variety of measures of model performance is offered and [7] propose cutoffs for appraising the value of a computed score. How Brier scores can be formally compared is detailed in Ref. [8].

Because of a growing number of applications and in light of the description in [9, p. 1253], we would like to briefly discuss the Brier score and its connection to Spiegelhalter's calibration test [10].

For n predictive probabilities p=(p1,…,pn) with 0pi1 and n realizations x=(x1,…, xn) of Bernoulli random variables XiBer(πi) with and xi{0, 1}, the Brier score, defined as

(1)
equals the mean squared error of prediction. As a proper scoring rule, the Brier score simultaneously addresses calibration, the statistical consistency between the predicted probability and the observations as well as sharpness, which refers to the concentration of the predictive distribution, see Ref. [11]. This feature is also nicely illustrated by Murphy's decomposition of the Brier score, see Ref. [12]. When assessing the predictive accuracy of different binary predictions, for example, several logistic regression models, the Brier score can be used to compare model performances, see [6, Box 1]. Being mainly a relative measure, a lower score points to a superior model; the actual value of the score seems of limited value.

In the decomposition (1) the first summand has expectation 0 under perfect calibration, that is, if . This is exploited in the construction of Spiegelhalter's z-statistic [10], [13] that enables formal assessment of calibration of binary predictions. The z-statistic is defined as

(2)

The null hypothesis of calibration, that is, is rejected at the significance level α if , where qα is the α-quantile of the standard normal distribution. This short summary makes it clear that

calibration is neither equal to prediction error

nor does a lower Brier score necessarily indicate better calibration,

as suggested by [9, p. 1253]. Indeed, suppose two Bernoulli experiments are performed and two forecasters issue predictions p1=(0.2, 0.2) and p2=(0.4, 0.5), and x=(1, 0) materializes. The resulting Brier scores and values of Spiegelhalter's z-statistic for these two competing models are provided in Table 1.

Table 1. Value of Brier score and Spiegelhalter's z-statistic for the two models and the realization x=(1, 0)
ModelpB(p, x)Z(p, x)
1(0.2, 0.2)0.341.06
2(0.4, 0.5)0.301.22

According to the value of the Brier score, the second model is to be preferred over the first; however, because this second model is less calibrated than the first. This simple example reveals that it is not generally true that a lower Brier score implies better model calibration. The reason is that the Brier score simultaneously addresses calibration and sharpness, see the discussion above. To exclusively address calibration, Spiegelhalter's z-test should be used. To us, it is therefore unclear what the purpose is of testing the Brier score on the value of 0, as indicated by [9, Table 2]. Instead, we hazard a guess that the authors actually intended to say that their models marked with a “” in [9, Table 2] are not well calibrated. However, this does not correspond to have a Brier score value that is significantly different from 0 but Spiegelhalter's z-statistic different from 0.

To conclude, we advocate the use of the Brier score to assess predictive accuracy of binary prediction models, and we also agree that calibration of such models is an important issue and should be addressed when comparing models via the Brier score. With this short note, we intended to clarify some aspects when using these tools.

Back to Article Outline

References 

  1. Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78:1–3
  2. Itoh S, Ikeda M, Mori Y, Suzuki K, Sawaki A, Iwano S, et al. Lung: feasibility of a method for changing tube current during low-dose helical CT. Radiology. 2002;224:905–912
  3. Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. J Clin Epidemiol. 2005;58:475–483
  4. Huo D, Senie RT, Daly M, Buys SS, Cummings S, Ogutha J, et al. Prediction of BRCA mutations using the BRCAPRO model in clinic-based African American, Hispanic, and other minority families in the United States. J Clin Oncol. 2009;27:1184–1190
  5. Steyerberg EW. Clinical prediction models. New York, NY: Springer; 2009;
  6. Harrison DA, Brady AR, Parry GJ, Carpenter JR, Rowan K. Recalibration of risk prediction models in a large multicenter cohort of admissions to adult, general critical care units in the United Kingdom. Cri Care Med. 2006;34:1378–1388
  7. Steyerberg EW, Harrell FE, Borsboom GJ, Eijkemans MJ, Vergouwe Y, Habbema JD. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001;54:774–781
  8. Redelmeier DA, Bloch DA, Hickam DH. Assessing predictive accuracy: how to compare Brier scores. J Clin Epidemiol. 1991;44:1141–1146
  9. Lix LM, Yogendran MS, Leslie WD, Shaw SY, Baumgartner R, Bowman C, et al. Using multiple data features improved the validity of osteoporosis case ascertainment from administrative databases. J Clin Epidemiol. 2008;61:1250–1260
  10. Spiegelhalter DJ. Probabilistic prediction in patient management and clinical trials. Stat Med. 1986;5:421–433
  11. Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc. 2007;102:359–378
  12. Murphy AH. Scalar and vector partitions of the probability score: Part I. Two-state situation. J Appl Meteorol. 1972;11:273–282
  13. StataCorp. STATA, reference A-F. Stata Corporation; 2003.

PII: S0895-4356(09)00363-1

doi:10.1016/j.jclinepi.2009.11.009

Refers to article:

  • Using multiple data features improved the validity of osteoporosis case ascertainment from administrative databases , 11 July 2008

    Lisa M. Lix, Marina S. Yogendran, William D. Leslie, Souradet Y. Shaw, Richard Baumgartner, Christopher Bowman, Colleen Metge, Abba Gumel, Janet Hux, Robert C. James
    Journal of Clinical Epidemiology December 2008 (Vol. 61, Issue 12, Pages 1250-1260)

Journal of Clinical Epidemiology
Volume 63, Issue 8 , Pages 938-939, August 2010