Implementing Hospital-Wide NLPs

Hospital-wide natural language processing summarising the health data of 1 million patients

Daniel Bean, Zeljko Kraljevic, Anthony Shek, James Teo, Richard Dobson

Introduction and Methodology 

The article discusses the application of natural language processing (NLP) algorithms to extract information from clinical text in electronic health records (EHRs) for over one million patients at a large London hospital trust. The authors applied open-source named-entity-recognition and linkage (NER+L) methods to 9.5 million EHR documents, generating 157 million SNOMED concepts to capture disease prevalence, onset, and major comorbidity patterns. This study demonstrates NLP's potential to transform the health data lifecycle through large-scale automation of a traditionally manual task.

Challenges and Benefits of NLP in Clinical Data Analysis 

Clinical notes and letters remain the primary method for recording and sharing medical information between clinical staff. Analyzing text data is challenging, but NLP combined with clinical terminologies, such as SNOMED, can automate a significant portion of the "structure and standardize" process, making the full clinical record accessible for computational analysis. The analysis provides a detailed description of the scale and nature of the available data and enables the detection of disease burden, onset, and co-occurrence patterns.

Patient Data and Disease Prevalence 

The data for 1.07 million patients were extracted, with hypertension being the most prevalent disorder for both sexes. The authors used NHS England business rules and the NHS England Quality and Outcomes Framework (QoF) data to calculate prevalence rates of specific disease conditions. The analysis highlights differences in prevalence between males and females and in age at first detection for some conditions.

Identification of Patient Clusters and Limitations of NLP 

The study identified clusters of patients who shared common diagnoses, which could be a useful tool for identifying cohorts of patients with similar disease patterns using NLP. It also discusses the strengths, limitations, and ethical considerations of using NLP to mine EHR data and acknowledges that errors may still be introduced due to changes in clinical language and terminology.

Conclusion and Future Implications 

Despite the study's limitations, such as only including patients from a single south London hospital, it demonstrates the potential of NLP to extract structured data from unstructured clinical notes, enabling large-scale analysis of multi-morbidity. This could aid in hypothesis generation, population health studies, and clinical research using real-world data.