The expansive availability of digital health data has led to a colossus of data-driven models to guide and improve healthcare delivery. This change of paradigm will and partially already does decisively shape medical diagnostics, drug discovery, and personalized medicine, thereby approaching and sometimes even surpassing expert clinicians in certain specialties [1, 2]. However, emerging evidence suggests that many of these data-driven clinical decision support tools may be biased and not equally benefit all populations [3-6]. To put it simply: As AI makes progress to improve quality of care for some patients, others are left behind [7-10]. Particularly minorities and historically disadvantaged groups are at risk of suffering from unreliable model predictions, as we have seen for example in the case of COVID-19 [11, 12]. Standardized empirical evaluation studies are required to keep biased AI models from perpetuating and even aggravating systemic health disparities through dangerous feedback loops. Given the ubiquitous inequalities in our healthcare systems and society at large remaining today, we believe that a discussion about how AI may contribute to finally breaking the vicious cycle of discrimination affecting countless minorities is more timely than ever. In our article “Peeking into a black box - the fairness and generalizability of a MIMIC-III benchmarking model” that was just published in Nature Scientific Data, we showcase an empirical evaluation framework and discuss the pervasive challenges around bias and fairness in risk prediction models.
AI models are susceptible to bias since they learn themselves from data reflecting an intrinsically unjust healthcare system [13, 14]. In the absence of regulatory oversight, AI could hence unconsciously reinforce pre-existing biases [15, 16]. Such risks not only threaten the care of vulnerable patient groups but also corrode the public’s trust and therefore critically hamper the successful adoption of clinical AI tools [17. In addition to systematic fairness assessments, data sharing and transparency are hence key to build up trust, improve model quality and foster a better understanding of potential biases to enable effective mitigation. The Medical Information Mart for Intensive Care (MIMIC) is a prime example of such effort . Its creation in 2011 constituted a paradigm change, being one of the first publicly available electronic health record databases. MIMIC is now widely seen as a reference standard, its broad use has spurred the development of thousands of AI models and it also serves as an invaluable educational resource to train the next generation of AI developers.
While there is a wealth of efforts underway to forge and highlight pathways towards a better understanding of model fairness and generalizability, the identification of a standard set of metrics for systematic, objective and comprehensive evaluation is still emerging. In particular, no simple universal notion of fairness exists as of yet. Hence, we designed an extensive fairness and generalizability assessment framework to highlight the potential for validating and refining clinical decision support models developed in publicly accessible datasets. With the idea of an iterative improvement process in mind, we studied a recently published MIMIC-III in-hospital mortality benchmark model that received great attention in the scientific community . We started by replicating the study results, then tested the MIMIC-trained model on the Stanford Medicine Research Data Repository and finally re-trained and tested the model again in this independent validation set. For all three case study settings, we ran our extensive fairness and generalizability assessment framework covering the three major classes of fairness definitions (anti-classification, classification parity and calibration) to characterize the risk of any undue bias towards specific demographic groups based on gender, ethnicity and insurance payer type as a socioeconomic proxy.
While there are many strengths to the studied benchmark model, we also identified several limitations related to class imbalance and fairness that require mitigation and transparent reporting. Specifically, we found three main problems to be addressed:
- The benchmark model suffers from a typical class imbalance problem, where good overall model performance masks the fact that minority class instances only rarely get correctly classified. In this specific case, this means that only every fourth to fifth minority class patient at high risk of dying during the hospital stay is also identified as such by the AI tool.
- While the model is capable of generalizing to different data sources, its predictive model performance is markedly lower for certain ethnic and socioeconomic minority groups.
- Model calibration studies reveal differences in patient comorbidity burden for identical model risk predictions across socioeconomic groups.
These results highlight the extent of masked bias in high-quality scientific work and the need for thorough fairness evaluations before publishing, re-using and, in particular, deploying AI-guided clinical decision support given the potential long-lasting and severe negative repercussions that model bias may cause. In this specific case, our work shows that particularly Black and socioeconomically vulnerable patients would be at risk, though other minority groups may also be affected [3-6]. These findings are of particular importance since this model has already been used several dozen times to benchmark new modeling techniques without any prior performance or fairness checks reported [20-25].
As the use of AI in healthcare continues to expand, an important aspect of the safe and equitable dissemination of AI-based clinical decision support is the thorough evaluation of the model and its downstream effects, a step that goes beyond predictive model performance to further encompass bias and fairness evaluations. Correspondingly, regulatory agencies seek to further initiate and establish real-world performance requirements and test beds for the development of decision-support models . As our work shows, pervasive challenges around bias and fairness in risk prediction models as well as the wide-spread use of simplistic evaluation metrics remain. The repercussions from such non-comprehensive evaluation frameworks is a safety concern of entire populations, where the most vulnerable will ultimately suffer the most. Hence, we caution against the imprudent use of benchmark models lacking fairness assessments and external validation in order to make true progress and build trust in the community.
 Yu, KH., Beam, A.L. & Kohane, I.S. Artificial intelligence in healthcare. Nat Biomed Eng 2, 719–731 (2018). https://doi.org/10.1038/s41551-018-0305-z
 Nagendran, M. et al. (2020). Artificial intelligence versus clinicians: Systematic review of design, reporting standards, and claims of deep learning studies in medical imaging. BMJ. 368. m689. 10.1136/bmj.m689.
 Zou, J., & Schiebinger, L. (2018). AI can be sexist and racist - it's time to make it fair. Nature, 559(7714), 324–326. https://doi.org/10.1038/d41586-018-05707-8
 Chen, I., Szolovits, P. & Ghassemi, M. (2019). Can AI Help Reduce Disparities in General Medical and Mental Health Care?. AMA J Ethics, 2019;21(2):E167-179. doi: 10.1001/amajethics.2019.167.
 Chen, I., Johansson, F.D. & Sontag, D. (2018). Why Is My Classifier Discriminatory?. arXiv preprint arXiv:1805.12002.
 Meng, C. et al. (2021). MIMIC-IF: Interpretability and Fairness Evaluation of Deep Learning Models on MIMIC-IV Dataset. arXiv preprint arXiv:2102.06761.
 Chen, I. et al. (2020). Ethical machine learning in healthcare. Annual Review of Biomedical Data Science, 4, 2020.
 Rajkomar, A. et al. (2018). Ensuring Fairness in Machine Learning to Advance Health Equity. Ann Intern Med. 2018;169:866-872. doi:10.7326/M18-1990
 Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science. 366. 447-453. https://doi.org/10.1126/science.aax2342
 Petersen, C. et al. (2021). Recommendations for the safe, effective use of adaptive CDS in the US healthcare system: an AMIA position paper. Journal of the American Medical Informatics Association, https://doi.org/10.1093/jamia/ocaa319
 Röösli, E., Rice, B. & Hernandez-Boussard, T. (2020). Bias at warp speed: how AI may contribute to the disparities gap in the time of COVID-19. Journal of the American Medical Informatics Association, Volume 28, Issue 1, January 2021, Pages 190–192, https://doi.org/10.1093/jamia/ocaa210
 Paulus, J. K. & Kent, D. M. (2020). Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities. NPJ digital medicine, 3(1):1–8, 2020.
 FitzGerald, C. & Hurst, S. (2017). Implicit bias in healthcare professionals: a systematic review. BMC Med Ethics 18, 19. https://doi.org/10.1186/s12910-017-0179-8
 Vyas, D. A., Eisenstein, L. G. & Jones, D. S. (2020). Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms. New England Journal of Medicine, 383:874-882
 Gianfrancesco, M. A., Tamang, S., Yazdany, J., & Schmajuk, G. (2018). Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data. JAMA internal medicine, 178(11), 1544–1547. https://doi.org/10.1001/jamainternmed.2018.3763
 Challen, R. et al. (2019). Artificial intelligence, bias and clinical safety. BMJ Quality & Safety 2019;28:231-237.
 McCradden, M. D., Joshi, S., Mazwi, M., & Anderson, J. A. (2020). Ethical limitations of algorithmic fairness solutions in health care machine learning. The Lancet. Digital health, 2(5), e221–e223. https://doi.org/10.1016/S2589-7500(20)30065-0
 Johnson, A. et al. MIMIC-III, a freely accessible critical care database. Sci Data 3, 160035 (2016). https://doi.org/10.1038/sdata.2016.35
 Harutyunyan, H., Khachatrian, H., Kale, D.C. et al. Multitask learning and benchmarking with clinical time series data. Sci Data 6, 96 (2019). https://doi.org/10.1038/s41597-019-0103-9
 Gupta, P., Malhotra, P., Vig, L. & Shrof, G. Using features from pre-trained timenet for clinical predictions. In Proceedings of the 3rd International Workshop on Knowledge Discovery in Healthcare Data at IJCAI-ECAI, 38–44 (Stockholm, Sweden, 2018).
 Gupta, P., Malhotra, P., Vig, L. & Shrof, G. Transfer learning for clinical time series analysis using recurrent neural networks. In Machine Learning for Medicine and Healthcare Workshop at ACM KDD 2018 Conference (London, United Kingdom, 2018).
 Jin, M. et al. Improving hospital mortality prediction with medical named entities and multimodal learning. In Machine Learning for Health (ML4H) Workshop at NeurIPS (Montreal, Canada, 2018).
 Oh, J., Wang, J. & Wiens, J. Learning to exploit invariances in clinical time-series data using sequence transformer networks. In Proceedings of the 3rd Machine Learning for Healthcare Conference, vol. 85, 332–347 (PMLR, Palo Alto, California, USA, 2018).
 Malone, B., Garcia-Duran, A. & Niepert, M. Learning representations of missing data for predicting patient outcomes. Preprint available at https://arxiv.org/abs/1811.04752 (2018).
 Chang, C.-H., Mai, M. & Goldenberg, A. Dynamic measurement scheduling for adverse event forecasting using deep RL. In Machine Learning for Health (ML4H) Workshop at NeurIPS (Montreal, Canada, 2018).