Your team is merging two patient-registration systems for a nationwide health provider. System A contains 1.2 million records and System B contains 900 000. Each record stores Social Security Number (SSN), first name, last name, date of birth, and ZIP code, but about 15 percent of the SSNs in System B are missing. A deterministic inner join on SSN has matched 70 percent of System B's records. The business goal is to increase the overall match rate while keeping the false-match rate below 1 percent and avoiding a full Cartesian comparison of the two files. Which data-matching strategy is most appropriate?
Accept only the SSN matches and treat all unmatched rows as new patients to avoid introducing false links.
Create Soundex codes for first and last names and run a fuzzy join on every remaining record pair without using any blocking strategy.
Perform a second deterministic join that keeps the SSN match and additionally links any records whose first three name characters match exactly.
Use a probabilistic Fellegi-Sunter linkage that first blocks on last name and birth year, then applies Jaro-Winkler similarity on names and exact or near-exact comparisons on date of birth and ZIP code to compute match weights.
A probabilistic Fellegi-Sunter linkage retains the low false-positive rate of deterministic methods while recovering matches lost to missing or slightly inconsistent identifiers. Blocking on stable, high-cardinality fields such as last name and birth year limits the number of candidate pairs so the algorithm remains computationally feasible. Within each block, field-specific comparison functions (for example, Jaro-Winkler similarity for names and exact or near-exact checks on date of birth and ZIP code) generate weighted agreement scores. These weights are combined into a match probability that can be thresholded to keep false matches under 1 percent. The other options either remain largely deterministic (and therefore miss many true links), drop blocking (creating excessive comparisons and higher false positives), or give up on matching altogether.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is probabilistic Fellegi-Sunter linkage?
Open an interactive chat with Bash
How does Jaro-Winkler similarity work in name comparisons?