The quality of training data for mammographic AI systems: investigating factors to reduce hype and increase clinical translation


  • Zhengqiang Jiang The University of Sydney
  • Phuong D. Trieu The University of Sydney
  • Ziba Gandomkar The University of Sydney
  • Seyedamir Tavakoli Taba The University of Sydney
  • Melissa L. Barron The University of Sydney
  • Sarah J. Lewis The University of Sydney


Background: Breast cancer is one of leading causes of female cancer-related death, with 25-30% of the total new cancer cases in women annually and AI breast screening has demonstrated promising detection results. High quality digital health data are required to train a reliable and effective AI model for breast cancer however different radiologists have different experience in lesion segmentation and levels of agreement for lesion location. Aims: This study aims to evaluate the AI systems by analyzing the quality of data with concordance between annotations of radiologists.

Methods: MIAS Mammography Dataset (320 images, 33% cancer cases, 1 radiologist annotating cancers) and Lifepool (856 images, 2 radiologists) were used with data from overlapped annotations using Lin’s concordance correlation coefficient (CCC) ‘almost perfect’ category. An AI system which applied deep convolutional neural networks to detect breast cancer was implemented.

Results: The validation accuracy for training and testing the AI system with data from the combination of MIAS and ‘almost perfect’ Lifepool datasets is higher (0.95 versus 0.9 training; 0.97 versus 0.9 testing). In addition, the training process of the AI system from the combination datasets outputs higher validation accuracy in an earlier epoch, which indicates less training computation time to train the accurate AI system.

Conclusions: AI system performance is affected by the concordance within the training data, with ‘almost perfect’ data yielding more accurate cancer detection and reduce training time. Future work will investigate other training data in the level of ‘substantial’, ‘moderate’ and ‘poor’ for AI performance to map the noise within the system.





Oral Presentations