CompTIA DataX DY0-001 (V1) Practice Question

A Spark job loads an e-commerce fact table (≈120 million rows, 50 columns) each day. Because the same vendor file can be delivered several times in one day, the raw table contains duplicate rows that are identical on every column except load_ts (the ingestion timestamp).
The engineering team must build a cleansed DataFrame that

keeps only the earliest load_ts for each unique combination of order_id and line_id,
produces identical results on every rerun, regardless of partitioning, and
avoids the wide aggregations that can exhaust executor memory on very large tables.
They must also save all removed rows to duplicates_audit for compliance review.

Which PySpark approach best satisfies these requirements?

Cache the raw DataFrame and run dropDuplicates(["order_id","line_id"]); use a subtract to obtain duplicates_audit.
Call distinct() on the entire raw DataFrame so that duplicates disappear automatically; the difference between the two DataFrames becomes duplicates_audit.
Generate rn = row_number().over(Window.partitionBy("order_id","line_id").orderBy("load_ts")), keep rows where rn == 1, then anti-join this result back to the raw DataFrame to create duplicates_audit.
Run groupBy("order_id","line_id").agg(min("load_ts").alias("load_ts")) and inner-join the result to the raw DataFrame; the non-matching rows form duplicates_audit.

CompTIA DataX DY0-001 (V1)

Operations and Processes

Your Score:

SAVE $64

CompTIA DataX Voucher

v1 / DY0-001

$529.00 $465.00

Bash, the Crucial Exams Chat Bot

AI Bot

CompTIA DataX DY0-001 (V1) Practice Question

Answer Description

Ask Bash

What is a Window function in PySpark?

Why is the anti-join operation used, and how does it work?

What are the drawbacks of using `dropDuplicates` for this task?

Monthly

$19.99

Billed monthly,
Cancel any time.

3 Month Pass

$44.99

One time purchase of $44.99,
Does not auto-renew.

Annual Pass

$119.99

One time purchase of $119.99,
Does not auto-renew.

Lifetime Pass

$189.99

One time purchase,
Good for life.

All Exams

Unlimited Tests

Unlimited Questions

AI Tutor

Track scores

Report Cards

Voucher Discounts

Advanced PBQs

Included Exams

CompTIA DataX DY0-001 (V1) Practice Question

Report Issue

Answer Description

Ask Bash

What is a Window function in PySpark?

Why is the anti-join operation used, and how does it work?

What are the drawbacks of using `dropDuplicates` for this task?

Report Issue