An e-commerce company is running an online A/B experiment to compare a new fraud-detection model (Model B) with the current production model (Model A). Traffic is evenly split (50 / 50), and a power analysis shows that at least 120 000 transactions per model are required to detect the minimum effect size at α = 0.05 with 80 % power. After 24 hours, provisional dashboards suggest Model B has fewer false positives, and operations management asks the MLOps team to start increasing the share of requests routed to Model B each day so that potential benefits are realized sooner.
According to best practices for A/B testing within an MLOps pipeline, which response will best preserve the statistical validity of the experiment while still aligning with DevOps principles of controlled, incremental release?
Increase Model B's traffic share whenever its cumulative false-positive rate beats Model A, even if the sample-size target has not been met.
Maintain the 50 / 50 split until each model has processed the pre-calculated sample size, then decide whether to promote Model B.
Switch to a blue/green deployment that immediately sends 100 % of traffic to Model B while keeping Model A on standby for rollback.
Enable online learning in Model B during the experiment so its weights update continuously, eliminating the need for formal hypothesis testing.
The experiment's hypothesis test assumes a fixed, pre-specified sample size. Ending or modifying the test before each arm reaches that sample size breaks those assumptions, inflates type-I error, and can bias results (the classic "peeking" problem). Maintaining the original 50 / 50 allocation until both models have processed the calculated number of transactions upholds statistical validity; only after the stopping rule is met should traffic be re-allocated or the new model promoted.
A blue/green cut-over routes 100 % of traffic to the candidate model without any statistical comparison, so it is a deployment strategy-not a valid A/B test. Gradually shifting traffic based on interim metrics is another form of peeking that biases the test. Allowing Model B to learn online during the experiment alters the treatment mid-test, violating the requirement that variants remain fixed throughout the analysis window.
Therefore, keeping the original split until the predetermined sample size is reached is the only option that both respects DevOps discipline (controlled, reversible release) and preserves rigorous model validation.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is maintaining the 50/50 split critical for statistical validity in A/B testing?
Open an interactive chat with Bash
What is the 'peeking' problem in A/B testing, and why is it a concern?
Open an interactive chat with Bash
How does a blue/green deployment differ from an A/B test?