CompTIA DataX DY0-001 (V1) Practice Question

A Spark job loads an e-commerce fact table (≈120 million rows, 50 columns) each day. Because the same vendor file can be delivered several times in one day, the raw table contains duplicate rows that are identical on every column except load_ts (the ingestion timestamp).
The engineering team must build a cleansed DataFrame that

  • keeps only the earliest load_ts for each unique combination of order_id and line_id,
  • produces identical results on every rerun, regardless of partitioning, and
  • avoids the wide aggregations that can exhaust executor memory on very large tables.
    They must also save all removed rows to duplicates_audit for compliance review.

Which PySpark approach best satisfies these requirements?

  • Run groupBy("order_id","line_id").agg(min("load_ts").alias("load_ts")) and inner-join the result to the raw DataFrame; the non-matching rows form duplicates_audit.

  • Call distinct() on the entire raw DataFrame so that duplicates disappear automatically; the difference between the two DataFrames becomes duplicates_audit.

  • Generate rn = row_number().over(Window.partitionBy("order_id","line_id").orderBy("load_ts")), keep rows where rn == 1, then anti-join this result back to the raw DataFrame to create duplicates_audit.

  • Cache the raw DataFrame and run dropDuplicates(["order_id","line_id"]); use a subtract to obtain duplicates_audit.

CompTIA DataX DY0-001 (V1)
Operations and Processes
Your Score:
Settings & Objectives
Random Mixed
Questions are selected randomly from all chosen topics, with a preference for those you haven’t seen before. You may see several questions from the same objective or domain in a row.
Rotate by Objective
Questions cycle through each objective or domain in turn, helping you avoid long streaks of questions from the same area. You may see some repeat questions, but the distribution will be more balanced across topics.

Check or uncheck an objective to set which questions you will receive.

SAVE $64
$529.00 $465.00
Bash, the Crucial Exams Chat Bot
AI Bot