Linked health data: Dealing with demographic data discrepancies
Abstract
Background: Data linkage custodians tolerate error rates of ≤0.5%, which still represent 10,000s of records in large datasets. Demographic variables are important covariates in digital health studies, and discrepancies can influence analysis. It is inadequate to simply discard cases with errors, as errors may not arise from the linkage process, but may be inherent in data prior to linkage, and may thus exceed the error rate. Also, removing cases from multiple datasets compounds the impact.
Aims: Develop an approach to detecting, classifying, and correcting demographic data discrepancies in large linked datasets.
Methods: Using demographic variables sex, birthdate, death date from two data linkage studies (ASHLi and PAVLOVA) as examples, methods are presented for defining within- and between-dataset discrepancies in single-record-per-case and multiple-record-per-case datasets. The difference in dealing with categorical (e.g., sex) and date data is illustrated. Methods are presented for resolving:
- partially missing data
- choosing most common option from a value set
- resolving date boundary errors
- using surrounding date information to resolve date conflicts
Results: Our methods minimise the impact of data errors, retaining a significant proportion of cases (depending on the nature of the variables and the number of variables). A SAS macro which automates the work is presented.
Conclusions: While these strategies are not applicable to all demographic variables, our work shows that researchers can improve the quality of their data without having to discard all discrepant cases. Our automation of these processes will further improve access of these strategies to researchers.