During an architecture review, your team must choose an Apache Spark Structured Streaming execution mode for ingesting an unbounded sensor-event stream. The product owner requires end-to-end exactly-once processing so that no sensor reading is ever double-counted after a failure, but is willing to accept latencies in the hundreds-of-milliseconds range. Which statement correctly describes how the two built-in execution modes address this requirement?
The default micro-batch mode can satisfy the requirement because it achieves exactly-once semantics through checkpointing and idempotent sinks, whereas continuous processing trades that guarantee for sub-millisecond latency and only provides at-least-once semantics.
Neither mode provides exactly-once semantics; the application must implement two-phase commits to achieve that regardless of execution mode.
Both modes provide exactly-once semantics, but continuous processing achieves it by buffering the entire stream in memory rather than using checkpoints.
Continuous processing is required because it is the only mode that guarantees exactly-once semantics; micro-batch mode may lose data after a restart.
Spark's default micro-batch engine records source offsets in checkpoints and relies on idempotent sinks, allowing the system to replay any micro-batch after a failure without creating duplicates. This design delivers end-to-end exactly-once semantics, typically with latencies of about 100 ms. Continuous processing removes the micro-batch barrier and pushes latency down to the 1 ms range, but because each record may be replayed after a fault before the engine can defer state to durable storage, the guarantee is reduced to at-least-once. Therefore, selecting micro-batch mode satisfies the product owner's duplicate-prevention requirement, while continuous mode would violate it. The other options are incorrect because they either reverse these guarantees, claim both modes offer exactly-once, or assert that Spark never offers exactly-once without custom two-phase commits.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the difference between exactly-once semantics and at-least-once semantics?
Open an interactive chat with Bash
What are checkpointing and idempotent sinks in Apache Spark?
Open an interactive chat with Bash
Why does continuous processing only provide at-least-once semantics in Apache Spark?