A financial institution is deploying a critical credit risk model that is retrained monthly. Regulatory compliance mandates that any prediction made by the model in production must be fully reproducible, including the ability to trace the model's lineage and training environment. As the lead data scientist, you are designing the metadata logging strategy for the model registry. Which of the following metadata components is most crucial for ensuring long-term auditability and the ability to precisely replicate a specific production model version months or years after its deployment?
A detailed data dictionary for the input features, the business requirements document version, and the contact information for the product owner.
The versioned training dataset identifier, the source code's version control hash (e.g., Git commit ID), a complete record of hyperparameters, and a manifest of key software library versions (e.g., requirements.txt).
The model's final performance metrics on the hold-out test set, the target variable's statistical distribution, and the names of the engineers who approved the deployment.
The URI of the serialized model object, the total number of features used, and the compute resources (CPU/GPU type) utilized for training.
The correct answer includes the four key components required for full model reproducibility and auditability. To precisely recreate a model training process, one must have the exact data that was used, the specific code that was run, the hyperparameters that guided the training algorithm, and the software environment in which the training was executed.
The option concerning performance metrics and approvers is incorrect because this metadata is related to model evaluation and governance, not the technical replication of the model itself. The option listing the data dictionary and business requirements is incorrect because this describes project-level and contextual documentation, which, while important, does not provide the technical artifacts needed to retrain the model. The option with the model URI and compute resources is incorrect because it describes the output of the training process (the serialized model) and the training environment's hardware. While the model URI is necessary to load the existing artifact, it is insufficient for recreating it from scratch, which is a key requirement for a full audit.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
Why is a versioned training dataset identifier crucial for reproducibility in model audits?
Open an interactive chat with Bash
How does the Git commit ID contribute to model reproducibility and auditability?
Open an interactive chat with Bash
Why is tracking software library versions (e.g., requirements.txt) critical for training environment replication?