A U.S. health insurer is building a machine-learning model that predicts whether an adult member with diabetes will visit the emergency department (ED) in the next 30 days. The current feature set consists only of de-identified medical and pharmacy claims, and internal evaluation shows that performance has plateaued. The chief data scientist asks the team to add one external data source that will 1) cover nearly every member, 2) be inexpensive and publicly obtainable, 3) minimize additional protected-health-information (PHI) risk, and 4) have a known association with acute ED utilization in diabetes. Four candidate data feeds are available. Which data source best meets all of these requirements and is therefore most likely to improve the model's predictive power?
Continuous glucose monitor (CGM) readings collected through Dexcom's API for members who consent and use a compatible device.
CDC/ATSDR Social Vulnerability Index scores combined with American Community Survey socioeconomic indicators linked to each member's home ZIP Code.
Hourly temperature and humidity data from NOAA's Integrated Surface Database matched to the service date and location of each claim.
Daily step-count and sleep metrics exported from Fitbit wearables for members who opt in through the insurer's wellness portal.
Area-level socioeconomic context captured in the CDC/ATSDR Social Vulnerability Index (SVI) and related American Community Survey variables can be deterministically joined to every member by ZIP Code or census tract at negligible cost. Numerous studies show that higher SVI scores are associated with preventable or unplanned ED use among people with diabetes, so the features are predictive. Because the data are aggregated at the geographic level, they add little incremental PHI risk compared with the existing limited data set.
Hourly weather observations from NOAA may be public and low-cost but have a weaker, less direct link to short-term ED visits for diabetes. Fitbit and CGM streams can be highly informative for individual glycemic control, yet they apply only to the minority of members who own and authorize those devices, require commercial agreements, and introduce substantial PHI and integration overhead. Therefore, the SVI/ACS source is the only option that satisfies all four criteria and is most likely to produce a systematic lift in model performance.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the CDC/ATSDR Social Vulnerability Index (SVI)?
Open an interactive chat with Bash
Why do socioeconomic and geographic-level data minimize PHI risk compared to individual-level data?
Open an interactive chat with Bash
Why were Fitbit and CGM data considered less suitable for this model?