A data scientist is tasked with creating a unified customer view by merging two datasets:
The transactions table contains transaction_id and customer_email.
The profiles table contains profile_id (a surrogate primary key), full_name, and email.
The profile_id does not exist in the transactions table. A preliminary analysis shows that the email fields in both tables suffer from formatting inconsistencies, typos, and have a significant number of null values, making them unreliable as a sole identifier. Given this scenario, what is the most robust strategy for defining a key to merge these two tables?
Generate a new surrogate key using a hash function on the transaction_id in the transactions table and the profile_id in the profiles table.
Perform a cross join (Cartesian product) between the two tables and then filter the results where the email fields are an exact match.
Use the email field as a natural key for the join after filtering out all records where the email is null from both datasets.
Create a composite key for each dataset by first standardizing and then combining the customer_email and full_name fields before performing the join.
The correct answer is to create a composite key. Because a single field (like email) is unreliable due to nulls and errors, a composite key that combines multiple standardized fields (email and full_name) provides a more robust and resilient identifier for merging. This approach increases the likelihood of a successful match by leveraging more available information and mitigating the risk of match failure from a single faulty data point.
Using only the email field after filtering nulls is not ideal because it discards potentially linkable records where the name might match even if the email is missing.
A cross join is computationally expensive and impractical, creating a Cartesian product of all rows from both tables, which is inefficient for finding matches.
Generating a new surrogate key in one table does not solve the core problem of finding a corresponding record in the other table, as there is no basis for the match.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is using a composite key better than relying on a single field like email?
Open an interactive chat with Bash
What does 'standardizing' fields like email and full_name mean in this context?
Open an interactive chat with Bash
Why is using a cross join not a feasible option for merging these datasets?