During an architecture review, your team must choose an Apache Spark Structured Streaming execution mode for ingesting an unbounded sensor-event stream. The product owner requires end-to-end exactly-once processing so that no sensor reading is ever double-counted after a failure, but is willing to accept latencies in the hundreds-of-milliseconds range. Which statement correctly describes how the two built-in execution modes address this requirement?
Continuous processing is required because it is the only mode that guarantees exactly-once semantics; micro-batch mode may lose data after a restart.
Both modes provide exactly-once semantics, but continuous processing achieves it by buffering the entire stream in memory rather than using checkpoints.
Neither mode provides exactly-once semantics; the application must implement two-phase commits to achieve that regardless of execution mode.
The default micro-batch mode can satisfy the requirement because it achieves exactly-once semantics through checkpointing and idempotent sinks, whereas continuous processing trades that guarantee for sub-millisecond latency and only provides at-least-once semantics.
Spark's default micro-batch engine records source offsets in checkpoints and relies on idempotent sinks, allowing the system to replay any micro-batch after a failure without creating duplicates. This design delivers end-to-end exactly-once semantics, typically with latencies of about 100 ms. Continuous processing removes the micro-batch barrier and pushes latency down to the 1 ms range, but because each record may be replayed after a fault before the engine can defer state to durable storage, the guarantee is reduced to at-least-once. Therefore, selecting micro-batch mode satisfies the product owner's duplicate-prevention requirement, while continuous mode would violate it. The other options are incorrect because they either reverse these guarantees, claim both modes offer exactly-once, or assert that Spark never offers exactly-once without custom two-phase commits.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the difference between exactly-once semantics and at-least-once semantics?
Open an interactive chat with Bash
What are checkpointing and idempotent sinks in Apache Spark?
Open an interactive chat with Bash
Why does continuous processing only provide at-least-once semantics in Apache Spark?