Microsoft Fabric Data Engineer Associate Practice Test (DP-700)
Use the form below to configure your Microsoft Fabric Data Engineer Associate Practice Test (DP-700). The practice test can be configured to only include certain exam objectives and domains. You can choose between 5-100 questions and set a time limit.

Microsoft Fabric Data Engineer Associate DP-700 Information
The Microsoft Fabric Data Engineer Associate (DP-700) exam shows that you know how to work with data in Microsoft Fabric. It tests your ability to collect, organize, and prepare data so it can be used for reports and dashboards. Passing the DP-700 means you can build and manage data pipelines, use tools like Power BI and Azure Synapse, and make sure data is clean and ready for analysis.
This exam is best for people who already have some experience working with data or databases and want to move into a data engineering role. If you enjoy working with numbers, building reports, or using SQL and Python to manage data, this certification can help you stand out to employers. It’s designed for anyone who wants to show their skills in data handling using Microsoft tools.
Before taking the real exam, it’s smart to use DP-700 practice exams, practice tests, and practice questions to prepare. These tools help you get used to the types of questions you’ll see on test day and show which topics you need to study more. By using practice tests often, you can build confidence, improve your score, and walk into the exam knowing what to expect.

Free Microsoft Fabric Data Engineer Associate DP-700 Practice Test
- 20 Questions
- Unlimited
- Implement and manage an analytics solutionIngest and transform dataMonitor and optimize an analytics solution
In a Microsoft Fabric notebook, you are building a PySpark Structured Streaming query that must read real-time clickstream events from Azure Event Hubs and land them in a lakehouse so that Power BI DirectLake reports can query the data immediately. The solution must guarantee exactly-once processing after restarts and minimize small-file creation. Which writeStream configuration should you use?
.writeStream.format("delta").outputMode("append").option("checkpointLocation","/chkpt").start("Tables.clicks")
.writeStream.format("delta").outputMode("complete").option("checkpointLocation","/chkpt").start("/lakehouse/tables/clicks")
.writeStream.format("delta").outputMode("append").option("checkpointLocation","/chkpt").option("triggerOnce","true").start("Tables.clicks")
.writeStream.format("parquet").outputMode("append").option("checkpointLocation","/chkpt").start("/lakehouse/files/clicks")
Answer Description
Using the Delta format provides ACID transactions and exactly-once guarantees when a checkpoint location is specified. Writing directly to a managed Delta table in the lakehouse (Tables.clicks) makes the data instantly queryable through DirectLake for Power BI. Append mode is the recommended setting for streaming inserts. Parquet lacks transactional guarantees, complete mode rewrites the entire table each micro-batch, and triggerOnce turns the job into a batch rather than a continuously running stream.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is the Delta format preferred over Parquet in this scenario?
What does the checkpointLocation option do in PySpark Structured Streaming?
How does append output mode work in this context?
You are a Contributor in a Microsoft Fabric workspace that has item-level security enabled. You need to ensure that only the security group named DataScienceTeam can view and use a specific lakehouse, while other workspace items remain accessible to all workspace Members. Which action should you take to meet this requirement?
Open the lakehouse's Manage permissions pane yourself, remove inherited permissions, and assign the Viewer role to the DataScienceTeam group.
Apply a Confidential sensitivity label to the lakehouse and rely on Microsoft Purview to prevent unauthorized users from opening it.
Change the workspace access mode from Public to Private so that only explicitly added users, including DataScienceTeam, can access any item in the workspace.
Request a workspace Owner to break permission inheritance for the lakehouse, remove other users, and assign the Viewer role to the DataScienceTeam group.
Answer Description
Contributors cannot override inherited permissions for an individual lakehouse. To restrict access, you must ask a workspace Owner or Admin to open the lakehouse's Manage permissions pane, disable inherited permissions, remove other groups and users, and then grant the Viewer role only to the DataScienceTeam group. Changing the workspace access mode to Private would hide every item in the workspace, and applying a sensitivity label or enabling dynamic data masking does not control access to the lakehouse object.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is item-level security in Microsoft Fabric?
What roles do Contributors and workspace Owners play in Microsoft Fabric?
How does breaking permission inheritance work in Microsoft Fabric?
You manage a Microsoft Fabric workspace that contains a lakehouse named ContosoSales. Historical sales files are stored in an Azure Data Lake Storage Gen2 (ADLS Gen2) account that is not linked to Fabric. You must expose the folder /raw/sales/ in ADLS Gen2 to users of ContosoSales so that they can query the data from SQL analytics endpoints without copying the files into OneLake. Security administrators insist that:
- Access to the ADLS Gen2 account continues to be controlled with Azure role-based access control (Azure RBAC).
- Finance analysts who already have Storage Blob Data Reader rights on the ADLS Gen2 container must automatically be able to query the data from Fabric.
Which action should you perform when you create the shortcut in ContosoSales to meet these requirements?
Create the shortcut with anonymous (public) access and rely on Fabric workspace roles for security.
Configure the shortcut to use Azure Active Directory passthrough authentication.
Copy the files into a new managed table in the lakehouse and grant Fabric item-level permissions instead of creating a shortcut.
Configure the shortcut to use a service principal that has Storage Blob Data Reader rights on the container.
Answer Description
When you create a shortcut from a Fabric lakehouse to an external ADLS Gen2 location, you can choose Azure Active Directory (Azure AD) passthrough as the authentication method. With Azure AD passthrough, OneLake uses each caller's own Azure AD identity to access the underlying storage. Therefore any user who has the necessary Azure RBAC role assignments (for example, Storage Blob Data Reader) on the ADLS Gen2 container will automatically be able to read the data when they query through the lakehouse's SQL endpoint. Because access is enforced by the storage account, you do not need to provision or share a service principal, nor can you satisfy the requirement by selecting anonymous access or by copying the data into OneLake. Azure AD passthrough is the only option that keeps authorization in ADLS Gen2 while avoiding data duplication, so enabling Azure AD passthrough during shortcut creation is the correct choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Azure Active Directory passthrough authentication?
What is the role of Azure role-based access control (RBAC) in this setup?
Why is copying the data into OneLake not a suitable solution in this case?
You are designing a new fact table named SalesTransactions in a Microsoft Fabric data warehouse. The table will ingest about 500 million rows weekly. Business analysts frequently join SalesTransactions to the Customer dimension table on the CustomerID column and then aggregate sales per customer. To reduce data movement during these joins and achieve the best query performance, which distribution option should you choose when creating the SalesTransactions table?
Define the table as ROUND_ROBIN distributed across compute nodes.
Define the table as HASH distributed on the CustomerID column.
Define the table as HASH distributed on the transaction Date column.
Define the table as REPLICATE distributed to every compute node.
Answer Description
Hash distributing SalesTransactions on CustomerID ensures that rows sharing the same CustomerID are stored in the same distribution as the dimension table that is also hashed on CustomerID. This co-location avoids shuffle operations during joins, providing the highest performance for large fact tables. Replicating a 500-million-row table would exceed practical limits for replicate distribution. Round-robin distributes rows without regard to join keys, so data movement would still occur. Hashing on Date would not align the distributions of the two tables, again forcing expensive data shuffles.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is HASH distribution on CustomerID chosen for the SalesTransactions table?
What happens if ROUND_ROBIN distribution is used instead?
What are the practical limitations of REPLICATE distribution for large tables?
You manage a Microsoft Fabric workspace that contains a semantic model named Contoso Sales. The model is configured to refresh every hour. The data-quality team must receive an email whenever a scheduled refresh ends with a Failure status. You need a solution that can be configured directly in the Fabric (Power BI) service without building a custom monitoring pipeline. What should you do?
Open the semantic model's Scheduled refresh settings, enable refresh failure notifications, and add the data-quality distribution list as email recipients.
Pin the semantic model's refresh history to a dashboard and configure a data alert on the visual so that it emails the data-quality team when the value equals Failure.
Create an Azure Monitor metric alert that fires when the semantic model's refresh duration metric exceeds a threshold and add the distribution list to the action group.
Configure a Fabric capacity usage alert for the workspace and specify the distribution list as notification recipients.
Answer Description
In the Fabric (Power BI) service, each semantic model (dataset) has a Scheduled refresh settings pane. In that pane you can turn on refresh failure notifications and specify one or more email addresses in the Email recipients (also surfaced as Email these contacts or Failure notification recipients) box. When a scheduled refresh fails, the service automatically sends an alert email to every address listed. Creating Azure Monitor alerts or capacity-level alerts does not target the semantic model's refresh operation, and dashboard data alerts can only monitor numeric values rendered in a visual-refresh status is not exposed that way. Therefore, enabling the built-in refresh failure notification and adding the distribution list is the correct and simplest solution.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the Fabric semantic model?
How do you configure scheduled refresh and failure notifications in Fabric?
Why are Azure Monitor or dashboard visual alerts inappropriate for refresh failure notifications?
You are designing a daily process in Microsoft Fabric to retrieve foreign-exchange rates. The process must read a list of currency codes from a table, call an external REST API once for each code, perform complex JSON parsing that relies on a custom Python library, and then write the results as Delta tables in a Lakehouse. Developers want to run and debug the logic interactively before scheduling it in production. Which Fabric item should you choose to implement the core data-transformation logic?
Develop a Dataflow Gen2 that retrieves the API data and performs the parsing and write-back operations.
Create a Spark notebook and schedule it (or call it from a pipeline) to handle the loop, API calls, parsing, and writes.
Use a Data Factory pipeline that executes a stored procedure in a Fabric warehouse to perform the API calls and transformations.
Build a Data Factory pipeline that uses a Copy activity combined with a Mapping Data Flow to ingest and transform the API responses.
Answer Description
A Spark notebook is best suited when you need flexible, code-driven data processing in Fabric. Notebooks let you write and test Python (or Scala, SQL, R) interactively, import third-party libraries, loop over input data, and invoke REST APIs. After development, the notebook can be scheduled or invoked from a pipeline for orchestration. Data Factory pipelines, Copy activities, and Dataflow Gen2 are low-/no-code tools optimized for declarative data movement and transformations; they cannot easily perform row-by-row API calls or use arbitrary Python packages. Stored procedures in a warehouse also lack the ability to call external web services or use Python libraries. Therefore, the notebook option meets all requirements.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a Spark notebook in Microsoft Fabric?
Why can’t Data Factory pipelines handle the described task effectively?
What is the difference between using a Delta table and a regular table in a Lakehouse?
You created a Microsoft Fabric workspace in live (default) mode to develop a new lakehouse and several Data Factory pipelines. Your team now requires full source control so that each change can be committed to a feature branch and reviewed before it reaches the main branch. You must also keep the existing workspace items and continue development without losing any work. What should you do first?
Create a new Fabric workspace in Git mode, export the current artifacts as .pbip files, and import them into the new workspace.
Enable change tracking at the lakehouse level and configure the Data Factory pipelines to use published versions only.
Configure a deployment pipeline for the current workspace and add a new Development stage linked to the feature branch.
Switch the existing workspace from Live mode to Git mode and connect it to the team's Azure DevOps repository and development branch.
Answer Description
Live-connected Fabric workspaces are not under source control. To put every artifact (lakehouse, pipeline, dataflow, notebook, etc.) under version control while preserving current work, you must convert the workspace from Live mode to Git mode and connect it to a repository and branch. This enables commit, pull-request, and merge workflows for every subsequent change. Creating a new workspace, exporting items, or building a deployment pipeline would not meet the requirement because the source items would remain outside Git until the workspace itself is placed under Git integration; deployment pipelines manage promotion, not version history.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the difference between Live mode and Git mode in Microsoft Fabric workspaces?
How do you switch a Microsoft Fabric workspace from Live mode to Git mode?
Why is source control important in Microsoft Fabric workspaces?
You are building a data solution in Microsoft Fabric. Whenever a new CSV file arrives in a specific OneLake folder, you must automatically run a short PySpark notebook that cleans the data and appends it to a Lakehouse table. If the notebook fails, a notification must be posted to a Microsoft Teams channel. You want a low-code approach that provides built-in monitoring and retry capabilities. Which Fabric component should you implement?
Define a Delta Live Tables pipeline inside a notebook and schedule it with the Fabric SQL job scheduler, adding Teams notification logic in SQL.
Schedule the PySpark notebook to run every few minutes and add Python code that polls the folder and calls the Microsoft Teams API to post a message on errors.
Create a Data Factory pipeline that uses an event trigger on the OneLake folder, executes the PySpark notebook, and adds a conditional step to send the Teams notification on failure.
Configure a stored procedure in the Lakehouse SQL endpoint and invoke it through a OneLake event grid trigger to process the file and send Teams alerts.
Answer Description
Data Factory pipelines in Microsoft Fabric are designed for low-code orchestration. An event-based trigger can watch a OneLake folder and start the pipeline when a file is created. The pipeline can contain a Notebook activity to execute the PySpark notebook and a subsequent Teams (Web activity or Logic App) step that runs only if the notebook fails, providing conditional branching, retries, and monitoring in the Fabric portal. Running the whole process inside a standalone notebook would require custom polling and error-handling code, while Delta Live Tables and SQL job objects are not available or appropriate in Microsoft Fabric. Therefore, a Data Factory pipeline with an event trigger is the most suitable choice.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Microsoft Fabric's Data Factory and how does it enable low-code orchestration?
How does an event-based trigger in Data Factory work?
What are conditional steps in a Data Factory pipeline, and how can they handle failures?
You manage a Microsoft Fabric workspace that receives daily CSV files into a lakehouse bronze folder. The business wants a reusable transformation that non-developer analysts can build in the Power BI service without writing code. The solution must join the files, filter rows, create aggregates, and write the results to a dimension table in the same lakehouse. Which Fabric component should you use to build the transformation?
Implement a T-SQL stored procedure in a Fabric warehouse and call it from a pipeline.
Create a Dataflow Gen2 that loads its output to the lakehouse table.
Write a KQL query in a Real-Time Analytics database and materialize the results.
Develop a Spark notebook with PySpark transformations and schedule it in a pipeline.
Answer Description
Dataflow Gen2 uses the web-based Power Query editor, enabling analysts to create transformations without writing code and to output the results directly to a lakehouse table. Spark notebooks require PySpark skills, KQL targets Real-Time Analytics rather than scheduled batch processing into lakehouses, and T-SQL stored procedures require writing SQL code.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a Microsoft Fabric Dataflow Gen2?
How does Dataflow Gen2 enable analysts to work without writing code?
What are the benefits of using a lakehouse with Dataflow Gen2?
You are troubleshooting an hourly Dataflow Gen2 that transforms JSON files from a landing-zone folder and loads the results into a curated table. The last two refreshes failed. You need to determine the exact applied step that produced the error and review how many rows were processed immediately before the failure. Which action should you take in Microsoft Fabric?
Open the dataflow's Refresh history page and inspect the run-details pane for the failed refresh.
Review the pipeline activity run details in the Monitor hub.
Query the lakehouse's _sys.diag_execution_log table for the dataflow run to retrieve per-step metrics.
Enable query diagnostics in the Power Query editor and review the generated diagnostics tables.
Answer Description
The _sys.diag_execution_log system table records detailed execution information for every Dataflow Gen2 refresh, including each Power Query applied step, its duration, input and output row counts, and any error messages. By filtering this table for the failed run ID, you can pinpoint the step where the refresh stopped and inspect the row statistics right before the failure.
The Refresh history/run-details pane surfaces high-level error information but does not include per-step row counts. Query diagnostics only captures design-time traces inside the Power Query editor. Pipeline activity run details are available only when the dataflow is invoked from a pipeline and also lack internal transformation-step metrics.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the _sys.diag_execution_log table in Microsoft Fabric?
How can you filter the _sys.diag_execution_log table for a failed dataflow run?
Why does the Refresh history page not provide enough detail to debug a failed dataflow?
In a Microsoft Fabric notebook, you develop a PySpark Structured Streaming query that reads JSON events from Azure Event Hubs and writes five-minute order totals to a Delta table located in the Lakehouse's bronze zone. The solution must resume automatically with no data loss if the notebook runtime is restarted. When you configure the writeStream operation, which option must you set to meet the requirement?
Set option("mergeSchema", "true") on the writeStream sink.
Set option("checkpointLocation", "abfss://lakehouse/checkpoints/orders") on the writeStream sink.
Set option("ignoreChanges", "true") on the writeStream sink.
Set option("maxFilesPerTrigger", "1") on the writeStream sink.
Answer Description
Spark Structured Streaming keeps track of processed offsets and any intermediate aggregation state in a persistent checkpoint directory. By adding the checkpointLocation option to the writeStream definition, the query writes its metadata and state to reliable storage so that, after any failure or restart, Spark can recover exactly where it left off and continue processing without data loss or duplication. The other options serve different purposes: mergeSchema controls schema evolution when writing to Delta, ignoreChanges is used for Change Data Feed reads, and maxFilesPerTrigger simply limits the number of source files per micro-batch. None of those settings enable recovery after a failure.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is checkpointLocation in Spark Structured Streaming?
How does Spark Structured Streaming ensure recovery after a restart or failure?
Why doesn't mergeSchema, ignoreChanges, or maxFilesPerTrigger enable recovery?
You are a Microsoft Fabric administrator for Contoso. Your organization has enabled data domains and created several domain workspaces. A data engineering team discovers that they cannot assign a non-member security group to the 'Data Curators' role in one of the domain workspaces. You need to explain why this happens and how to allow the team to delegate curation tasks to that security group. Which action should you recommend?
Change the workspace from a domain workspace back to a personal workspace.
Move the workspace to a different Fabric capacity in the same tenant.
Add the security group as a member of the domain workspace and then assign it to the Data Curators role.
Enable Git-based version control for the workspace.
Answer Description
In a domain workspace, only users or groups that are already members of the workspace can be assigned to privileged domain roles such as Data Curators. To delegate curation to an external security group, you must first add that group as a workspace member. After the group is added, you can promote it to the Data Curators role. Changing the default capacity, enabling Git integration, or switching the workspace type does not affect role assignment eligibility.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a domain workspace in Microsoft Fabric?
What does the 'Data Curators' role in Microsoft Fabric do?
Why can't non-members be assigned to privileged roles like 'Data Curators' in domain workspaces?
You are designing a Microsoft Fabric solution to ingest a nightly 200-GB batch of structured sales data that lands in Azure Data Lake Storage Gen2. Analysts query the data in Power BI and need fully ACID transactions, time-travel auditing, and automatic index and statistics maintenance for complex T-SQL joins. Which Fabric data store should you load the sales data into to best satisfy these requirements?
Azure Cosmos DB for NoSQL
Fabric Real-Time Intelligence KQL database
Fabric Lakehouse
Fabric Warehouse
Answer Description
A Microsoft Fabric Warehouse is a managed, SQL-based analytical store built on OneLake. Warehouses provide full ACID transaction support, time-travel (via delta logs), and an engine that automatically maintains indexes and statistics so complex T-SQL queries from Power BI run efficiently. A Lakehouse also offers ACID and time-travel through Delta Lake but does not manage relational indexes or statistics, making it less suitable when many joined queries are expected. A KQL database is optimized for log and time-series data rather than relational joins, and Azure Cosmos DB targets operational NoSQL scenarios, not large analytical fact tables queried with T-SQL.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a Microsoft Fabric Warehouse, and why is it suitable for complex T-SQL queries?
How is the Fabric Lakehouse different from the Fabric Warehouse in terms of functionality?
What is the purpose of time-travel auditing in data stores like Microsoft Fabric Warehouse?
You manage a Microsoft Fabric workspace that contains a certified dataset named Sales Model. An external analyst must be able to connect to the dataset from Excel and Power BI Desktop and publish their own reports to a different workspace. The analyst must not be able to edit the dataset itself or see any other items in your workspace. Which item-level permission should you assign to the analyst on Sales Model?
Build (Read, reshare, and build)
Admin
Read
Contribute
Answer Description
Assigning the Build permission (displayed in the UI as Read, reshare, and build) on a single dataset lets a user connect from external tools, create new reports, and publish those reports to other workspaces. Build automatically includes Read but does not grant dataset edit rights, workspace-level privileges, or access to other items. Read alone would block report creation, while Contribute or Admin would allow modifying the dataset or managing the workspace.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the Build permission in Microsoft Fabric?
How does Build differ from Read permission?
Can a user with Build permission modify a dataset?
You manage a Fabric Eventstream that ingests 100,000 telemetry events per second from IoT devices into an Eventhouse table named Telemetry. Analysts complain that KQL queries filtered to the most recent hour take several seconds to return. Investigation shows that ingestion generates thousands of small extents (<50 MB). Without provisioning additional capacity, which configuration change is most likely to improve both ingestion efficiency and query performance?
Increase the target extent size by editing the ingestion batching policy for the Telemetry table.
Create a materialized view that summarizes the Telemetry table every five minutes.
Enable delta ingestion mode on the Eventstream output to the Telemetry table.
Reduce the table data retention period from 30 days to 7 days.
Answer Description
Eventhouses are backed by the Azure Data Explorer engine, which stores data in compressed columnar extents. When ingestion creates many small extents, each query must touch a large number of files, increasing both scan time and CPU overhead. By increasing the target extent (batch) size through the table's ingestion batching policy, the service accumulates more data before committing an extent, resulting in fewer, larger extents. This reduces the number of extents that queries must open and also improves compression efficiency, giving better overall performance without additional compute. Shortening retention only affects how long data is kept, not how it is laid out. Materialized views help for pre-aggregated workloads but do not accelerate raw-data queries that filter on recent time ranges. "Delta ingestion mode" is not a supported optimization in Fabric Eventstreams/Eventhouses.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an extent in Azure Data Explorer?
What is ingestion batching policy in Eventhouses?
How does columnar compression improve query performance?
You work in a Microsoft Fabric lakehouse. The Sales table has about 500 million rows, and the ProductSubcategory and ProductCategory tables each have fewer than 1 000 rows. You must build a daily Gold-layer table that denormalizes Sales with subcategory and category attributes while minimizing network shuffle and keeping the join in memory. Which Spark technique should you apply before running the joins?
Repartition the Sales DataFrame to a single partition, then perform the joins sequentially.
Disable Adaptive Query Execution so that Spark resorts to default shuffle hash joins.
Combine the three DataFrames with unionByName() and apply filters afterward.
Use the Spark broadcast() function (or BROADCAST join hint) on the two small lookup DataFrames before joining them to Sales.
Answer Description
Broadcasting very small lookup tables is a well-known Spark optimization. When you call broadcast() (or use the BROADCAST join hint) on ProductSubcategory and ProductCategory, the driver ships their data to every executor node. Each executor can then join its partition of the large Sales DataFrame locally, eliminating shuffle of the 500-million-row fact table. Repartitioning Sales to one partition forces single-threaded work, disabling AQE does not reduce shuffle, and unionByName() appends rows rather than joins.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a Spark broadcast join?
What is Adaptive Query Execution (AQE) in Spark?
How does network shuffle affect performance in Spark?
You are designing a streaming ingestion solution in Microsoft Fabric Real-Time Intelligence. IoT devices send 5,000 events per second through Azure Event Hubs. Analysts must run KQL queries with sub-second latency over the most recent 30 days of data. The events will not be stored in any other system. Which storage option should you configure as the Eventstream destination to meet these requirements?
A shortcut that references Delta Lake files in an existing Fabric Lakehouse
Mirrored tables that replicate data from Azure SQL Database
Native tables (native storage) in Real-Time Intelligence
Mirrored tables that replicate data from Azure Cosmos DB
Answer Description
Native tables use Fabric's internal OneLake storage, are optimized for high-throughput streaming ingestion, and allow very low-latency KQL queries. Because the events are not already held in another operational store, mirroring would add unnecessary replication overhead. Shortcuts simply reference existing files and therefore cannot ingest high-volume streaming data that does not yet exist elsewhere. Selecting native storage therefore delivers the required ingestion performance and interactive analytics while supporting a short, 30-day retention period.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is Azure Event Hubs and its role in streaming ingestion?
What are native tables in Microsoft Fabric Real-Time Intelligence?
Why are mirrored tables or shortcuts not suitable for IoT streaming ingestion?
An IoT gateway continuously writes raw CSV files to the /ingest/raw/ folder of a Microsoft Fabric Lakehouse. A PySpark notebook must run immediately after each new file arrives to cleanse the data and append it to a Delta table, without rerunning for earlier files. Which trigger type should you configure on the pipeline that calls the notebook?
Configure an event trigger that fires on the creation of a new file in the /ingest/raw/ folder.
Configure a tumbling-window trigger with a one-hour interval.
Rely on a manual trigger that operators can start after verifying file arrival.
Configure a scheduled trigger that runs the pipeline every five minutes.
Answer Description
An event trigger listens for file-creation events in Azure Data Lake Storage Gen2, which backs a Lakehouse folder. It starts the pipeline once for each new file, ensuring the PySpark notebook runs immediately and only for that file.
A scheduled trigger runs even when no files arrive, risking duplicate or unnecessary processing. A tumbling-window trigger is time-window based and not intended for instant per-file execution. A manual trigger relies on human initiation and cannot guarantee immediate processing.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is an event trigger in Microsoft Fabric?
How is an event trigger different from a scheduled trigger in pipelines?
Why is a tumbling-window trigger unsuitable for instant file processing?
You need to receive an email whenever a specific pipeline in a Microsoft Fabric Data Factory workspace runs for longer than 15 minutes. You want the solution to rely only on built-in Fabric capabilities and require the least ongoing administration. Which approach should you use?
Configure a data alert on a Power BI dashboard tile that displays the pipeline's execution time.
Create a threshold alert in the Monitoring hub by using an Activator that triggers when the pipeline run duration exceeds 15 minutes.
Add a SQL trigger on the lakehouse execution log table to send an email when a run lasts more than 15 minutes.
Create an Azure Monitor metric alert rule that evaluates the PipelineRunDuration metric for the workspace.
Answer Description
The Monitoring hub in Microsoft Fabric includes an Activator that can create threshold-based alerts for Data Factory pipelines, including run-duration conditions. By configuring an Activator alert that triggers when the pipeline run time exceeds 15 minutes, the notification is handled entirely inside Fabric without additional code or external services. Azure Monitor metric alerts cannot target a Fabric Data Factory workspace or evaluate the PipelineRunDuration metric. Power BI dashboard data alerts do not track pipeline run duration, and adding a SQL trigger to a lakehouse table would require custom logging and maintenance.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a threshold alert in Microsoft Fabric's Monitoring hub?
What is the role of Activators in the Monitoring hub?
Why are external tools like Azure Monitor not suitable for monitoring Fabric pipeline durations?
You ingested SalesFact, ProductDim (about 500,000 rows), and CategoryDim (30 rows) into a Microsoft Fabric lakehouse. Each product belongs to one category. To avoid extra joins in Power BI, you need to merge ProductDim and CategoryDim into a ProductExtended Delta table by using a PySpark notebook. Which approach minimizes shuffle and stays scalable as data volumes grow?
Repartition both tables to a single partition, perform a standard inner join, and then write the output.
Create a view that joins ProductDim and CategoryDim at query time and expose the view to Power BI.
Use a broadcast join in PySpark to join ProductDim with CategoryDim, then write the result to the ProductExtended Delta table.
Run a CROSS JOIN between ProductDim and CategoryDim and apply a filter on matching keys before writing the result.
Answer Description
A broadcast join distributes the very small CategoryDim table to every executor, allowing each partition of the larger ProductDim table to join locally without shuffling ProductDim across the cluster. This pattern scales well as ProductDim grows. Creating a view keeps the tables normalized, repartitioning to a single partition can throttle performance, and a cross join with a filter still causes a large shuffle.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a broadcast join in PySpark?
Why does a broadcast join scale better compared to other join types in a large dataset?
How do PySpark's repartition and join operations impact performance in large-scale data processing?
Nice!
Looks like that's it! You can go back and review your answers or click the button below to grade your test.