Microsoft Fabric Data Engineer Associate DP-700 Practice Question
You manage a Microsoft Fabric lakehouse that contains a Delta table named Transactions with 500 million rows added each month. For reporting, you must create a nightly process that produces a summary table with total sales amount and order count for every customer and month. The solution must use PySpark, minimize shuffle-related memory usage, and store the result as a managed Delta table for downstream queries. Which approach meets the requirements?
Call df.groupBy("customerId").pivot("sale_month").sum("amount").write.format("delta").mode("overwrite").saveAsTable("SalesMonthly") without partitioning.
Convert the table to an RDD, use map and reduceByKey to calculate sums and counts, then save the results as a single CSV file.
Use df.groupBy("customerId", "sale_month").agg(sum_("amount").alias("total_amount"), count("*").alias("order_count")).write.format("delta").partitionBy("customerId", "sale_month").mode("overwrite").saveAsTable("SalesMonthly")
Run a Spark SQL statement that uses GROUPING SETS to aggregate the data, then write the output in Parquet format without partitions.
Using DataFrame-level groupBy and agg lets Spark perform aggregation in a single stage that can take advantage of Tungsten optimizations. Writing the result in Delta format preserves ACID guarantees and efficient reads. Partitioning the output on customerId and sale_month keeps related rows in the same set of files, which reduces shuffle during later queries and improves predicate pruning. RDD map/reduce operations bypass Catalyst and Tungsten, increasing code complexity and memory pressure. Pivoting creates very wide rows that do not meet the reporting shape and can explode memory usage. GROUPING SETS in a SQL query is valid but writing the result as Parquet without partitioning sacrifices update concurrency and query performance compared to Delta plus partitioning.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is PySpark, and why is it used in this solution?
Open an interactive chat with Bash
Why is partitioning important in data storage, and how does partitionBy improve query performance?
Open an interactive chat with Bash
What are Catalyst and Tungsten optimizations in Spark, and how do they enhance performance?
Open an interactive chat with Bash
Microsoft Fabric Data Engineer Associate DP-700
Ingest and transform data
Your Score:
Report Issue
Bash, the Crucial Exams Chat Bot
AI Bot
Loading...
Loading...
Loading...
Pass with Confidence.
IT & Cybersecurity Package
You have hit the limits of our free tier, become a Premium Member today for unlimited access.
Military, Healthcare worker, Gov. employee or Teacher? See if you qualify for a Community Discount.
Monthly
$19.99
$19.99/mo
Billed monthly, Cancel any time.
3 Month Pass
$44.99
$14.99/mo
One time purchase of $44.99, Does not auto-renew.
MOST POPULAR
Annual Pass
$119.99
$9.99/mo
One time purchase of $119.99, Does not auto-renew.
BEST DEAL
Lifetime Pass
$189.99
One time purchase, Good for life.
What You Get
All IT & Cybersecurity Package plans include the following perks and exams .