Pulling EMR and Cancer Registry Data to Construct an MBC Cohort

Using Natural Language Processing to Construct a Metastatic Breast Cancer Cohort from Linked Cancer Registry and Electronic Medical Records Data

Albee Y Ling, Allison W Kurian, Jennifer L Caswell-Jin, George W Sledge Jr, Nigam H Shah, Suzanne R Tamang

Introduction and Study Finding 

This article in the Journal of the American Medical Informatics Association discusses the use of natural language processing (NLP) and machine learning techniques to develop a cohort of metastatic breast cancer patients using electronic medical records (EMR) linked with the California Cancer Registry (CCR). The goal was to address the lack of suitable population-based data resources for the study of distant recurrence among breast cancer patients in the United States. The researchers identified 1886 metastatic breast cancer patients, with 27.1% de novo metastatic patients and 72.9% recurrent metastatic patients. A regularized logistic regression model was trained for recurrent metastatic breast cancer classification, achieving an area under the receiver operating characteristic curve of 0.917, with sensitivity of 0.861, specificity of 0.878, and accuracy of 0.870.

Methodology and Framework 

By using a semisupervised machine learning framework, the researchers combined EMR and CCR data to enable population-based research on metastatic breast cancer. Their methodology allows for the retrospective case detection needed to identify distant recurrence among breast cancer patients and has the potential to support population-level surveillance research across California and nationally. The framework involves a custom NLP tool to extract MBC terms, then using distant supervision to label patients, and finally training logistic regression models for MBC recurrence classification. The classifiers performed similarly regardless of the presence of CCR features, indicating that unstructured data alone can be informative for identifying the recurrent MBC cohort in a population-based setting.

Evaluation and Contribution 

The framework was evaluated using a set of 146 manually reviewed patients for test performance, achieving an overall accuracy of 0.870, sensitivity of 0.861, and specificity of 0.878. The study's contribution includes retrieving information from unstructured text, applying a semi-supervised machine learning technique for metastatic recurrence classification, and leveraging complementary data sources (EMR and CCR) to develop a framework for detecting MBC and enabling population-based studies. The researchers emphasize the importance of taking advantage of both data resources to promote further outcomes research.

Potential and Conclusion 

Despite some limitations, the framework has tremendous potential to identify recurrent metastatic cancer patients, provide insights into the characteristics and outcomes of this understudied population, and can be readily adapted as more linked datasets become available. In conclusion, the study offers insights into the development and implementation of a machine learning framework that can accurately detect MBC.