Your team is preparing data for a churn-prediction model. A web-events table with 600 million rows must be linked to a 5 million-row CRM table so that behavioural features can be aggregated per customer. The only potentially shared attributes are 1) email_address - free-text that contains typos, mixed case and extra whitespace, and 2) phone - 10- to 14-digit numbers with inconsistent punctuation and optional country code. An exact inner join on both attributes retrieves only 72 % of the expected matches. The business requires at least 95 % linkage, and each linked pair must retain a confidence or similarity score for later audit. Memory is limited, so generating every possible record pair is not feasible.
Which data-wrangling approach best meets these requirements?
Lower-case and trim both attributes, hash each with SHA-256 to create a composite key, and join the two tables exactly on that hash value.
Standardise phone numbers, apply a phonetic or distance-based encoding (for example Soundex and Levenshtein) to the email local part, then perform a fuzzy join that outputs a similarity score column.
Remove all rows that have null values in either attribute and repeat an inner join on the cleaned columns without any further preprocessing.
One-hot encode the email domains, cluster the two tables with k-means, and cross-join records that fall into the same cluster to create candidate links.
Normalising the phone numbers (for example to E.164 or a digits-only format) removes formatting noise, and applying a phonetic or distance-based encoding such as Soundex plus Levenshtein distance to the email local part allows records that differ due to common typing errors to be compared. A fuzzy join that filters on a similarity threshold can then link the two tables while outputting the calculated score, giving the required ≥ 95 % match rate and an auditable match-quality column without materialising the full Cartesian product. Simply dropping nulls and re-running an exact join does not improve match coverage; hashing preserves only exact equality and destroys partial similarity information; clustering on one-hot encoded domains followed by cross-join is computationally heavy, does not directly yield pair-level confidence scores and is unlikely to improve deterministic linkage.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What are Soundex and Levenshtein, and how do they help in fuzzy matching?
Open an interactive chat with Bash
Why normalize phone numbers and what is E.164 format?
Open an interactive chat with Bash
What is a fuzzy join, and how does it differ from an exact join?