You have inherited a helper function named standardize_df() in a production feature-engineering library. The function must (1) subtract the mean and divide by the population standard deviation for every numeric column and (2) leave any column whose standard deviation is exactly zero unchanged to avoid division-by-zero problems. You are charged with adding a single PyTest unit test that delivers the strongest regression-catching power while still following unit-testing best practices (deterministic data, small scope, Arrange-Act-Assert structure, no unnecessary external libraries). Which test design best satisfies these requirements?
Generate 10 000 random rows, call standardize_df, and assert only that the output DataFrame has the same shape as the input.
Set numpy.random.seed(0) inside the test and simply check that standardize_df executes without raising an exception.
Apply scikit-learn's StandardScaler to a different DataFrame and assert that its output equals the output of standardize_df.
Construct a small DataFrame with one constant and one varying numeric column, run standardize_df, then assert with pytest.approx that the varying column now has mean 0 and std 1 and that the constant column is identical to the original.
The most effective test is the one that deliberately constructs a minimal, fully deterministic DataFrame that exercises both documented behaviors and then makes precise assertions. A two-column frame-one varying column and one constant (zero-variance) column-hits the boundary case that would otherwise raise division-by-zero errors. Using pytest.approx (or an equivalent tolerance-aware comparison) to assert the varying column's mean≈0 and std≈1 confirms correct standardization, while Series.equals on the constant column verifies that it was left untouched. This follows the Arrange-Act-Assert pattern, uses no external services, and will fail loudly if any future change alters the algorithm.
The other choices either observe only superficial properties (shape), rely on an additional library (StandardScaler) so that failures could originate outside the function under test, or make no content-specific assertion at all; hence they provide far less defect detection value.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the purpose of `pytest.approx` in the unit test?
Open an interactive chat with Bash
Why is having a constant column (zero-variance) important in the test?
Open an interactive chat with Bash
Why are deterministic data and the Arrange-Act-Assert structure crucial for unit tests?