Start: February 2023
End: August 2024
Funding: Dutch Research Council (NWA.1418.22.008)
The availability of electronic health records (EHR), containing valuable information about clinical practice, is growing. With recent progress in artificial intelligence and natural language processing, the number of medical prediction studies using text mining tools to automatically collect study variables and outcomes from free text is rapidly increasing. Examples include the prediction of text-extracted fall events in elderly, or prediction of text-extracted side effects in patients using antipsychotic medication. However, the potential impact of errors made by text mining tools on subsequent medical study results has received insufficient attention, and preconditions for responsible use of free text in such studies are absent (e.g., minimum text mining quality, reporting, but also interpretation pitfalls, including implications of the fact that absence of information is generally not evidence of absence in textual notes). This project will synthesize knowledge from three fields of research: (1) medical prediction research, (2) natural language processing, and (3) the field of statistical measurement error, aiming to:
Systematically study in what ways erroneous text mining models may induce bias in subsequent medical prediction studies.
Arrive at a set of preconditions and recommendations for responsible conduct, reporting, and interpretation of prediction research using variables automatically collected from free text.
Objective: To investigate how NLP models could be used to extract study variables (specifically exposures) to reliably conduct exposure-outcome association studies.
Objective: To investigate how the performance of text mining models to extract health outcomes relates to the performance of a downstream prognostic prediction model, developed on the text-mined outcome data
Objective: To assess the completeness of recording of relevant signs, symptoms, and measurements in Dutch free text fields of primary care electronic health records of adults with lower respiratory tract infections.