CompTIA DataX DY0-001 (V1) Practice Question

Your team is preparing data for a churn-prediction model. A web-events table with 600 million rows must be linked to a 5 million-row CRM table so that behavioural features can be aggregated per customer. The only potentially shared attributes are 1) email_address - free-text that contains typos, mixed case and extra whitespace, and 2) phone - 10- to 14-digit numbers with inconsistent punctuation and optional country code. An exact inner join on both attributes retrieves only 72 % of the expected matches. The business requires at least 95 % linkage, and each linked pair must retain a confidence or similarity score for later audit. Memory is limited, so generating every possible record pair is not feasible.

Which data-wrangling approach best meets these requirements?

Remove all rows that have null values in either attribute and repeat an inner join on the cleaned columns without any further preprocessing.
One-hot encode the email domains, cluster the two tables with k-means, and cross-join records that fall into the same cluster to create candidate links.
Lower-case and trim both attributes, hash each with SHA-256 to create a composite key, and join the two tables exactly on that hash value.
Standardise phone numbers, apply a phonetic or distance-based encoding (for example Soundex and Levenshtein) to the email local part, then perform a fuzzy join that outputs a similarity score column.

CompTIA DataX DY0-001 (V1)

Operations and Processes

Your Score:

SAVE $64

CompTIA DataX Voucher

v1 / DY0-001

$529.00 $465.00

Bash, the Crucial Exams Chat Bot

AI Bot

CompTIA DataX DY0-001 (V1) Practice Question

Answer Description

Ask Bash

What are Soundex and Levenshtein, and how do they help in fuzzy matching?

Why normalize phone numbers and what is E.164 format?

What is a fuzzy join, and how does it differ from an exact join?

Monthly

$19.99

Billed monthly,
Cancel any time.

3 Month Pass

$44.99

One time purchase of $44.99,
Does not auto-renew.

Annual Pass

$119.99

One time purchase of $119.99,
Does not auto-renew.

Lifetime Pass

$189.99

One time purchase,
Good for life.

All Exams

Unlimited Tests

Unlimited Questions

AI Tutor

Track scores

Report Cards

Voucher Discounts

Advanced PBQs

Included Exams

CompTIA DataX DY0-001 (V1) Practice Question

Report Issue

Answer Description

Ask Bash

What are Soundex and Levenshtein, and how do they help in fuzzy matching?

Why normalize phone numbers and what is E.164 format?

What is a fuzzy join, and how does it differ from an exact join?

Report Issue