A public health agency is conducting a longitudinal study on the impact of a new manufacturing facility on community respiratory health over a 15-year period. The data science team is using administrative data from local clinics, which consists of patient records, diagnostic codes, and dates of service. Which of the following represents the most significant analytical challenge inherent to using this type of data for this specific study?
The high financial cost of licensing and integrating patient data from numerous independent healthcare providers.
Selection bias resulting from the fact that the dataset only includes individuals from the community who have sought medical care.
Systematic shifts in data attributes resulting from changes in diagnostic criteria and data collection protocols over the 15-year period.
The procedural overhead of anonymizing personally identifiable information (PII) to comply with healthcare data regulations.
The correct answer identifies that systematic shifts in data definitions over a long period are a primary analytical challenge. Administrative data, such as medical records, are subject to changes in how information is recorded. For a 15-year longitudinal study, it is highly likely that diagnostic coding systems (e.g., the transition from ICD-9 to ICD-10), data entry software, and internal collection protocols have changed. These changes can create artificial trends or mask real ones, directly threatening the internal validity of the study's conclusions.
Selection bias is a valid and significant limitation, as the data only represents those who seek care, affecting the generalizability of the findings. However, for a longitudinal analysis, a changing measurement system presents a more fundamental analytical challenge to identifying trends over time.
The procedural overhead of anonymizing PII is a critical compliance and ethical step but is not an analytical challenge that impacts the statistical validity of the findings.
High financial cost is incorrect because administrative data is often chosen specifically because it is more cost-effective than generating new data through surveys or experiments.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are systematic shifts in data attributes, and how do they impact analysis?
Open an interactive chat with Bash
How does the transition from ICD-9 to ICD-10 affect healthcare studies?
Open an interactive chat with Bash
What methods can data scientists use to address changes in data collection protocols?