A global financial services company is subject to both GDPR and CCPA regulations. The MLOps team is designing a new credit risk model and must incorporate a compliant strategy for handling 'right to erasure' (or 'right to be forgotten') requests from customers. The solution must ensure that a user's data can be completely removed from influencing the model's predictions in a timely and auditable manner, without incurring the cost of a full model retrain for every request. Which of the following approaches is the most robust and technically sound for meeting this requirement?
Utilize differential privacy during initial training by adding calibrated noise to the dataset, thereby mathematically ensuring the model's outputs are not attributable to any single individual.
Create a suppression list of user IDs. During inference, check the list and if the user is present, provide a generic, non-personalized output instead of using the model.
Upon an erasure request, apply pseudonymization techniques like hashing to the user's records in the training data lake and then wait for the next scheduled quarterly model refresh to propagate the changes.
Implement an ensemble or sharded training architecture, such as SISA (Sharded, Isolated, Sliced, and Aggregated), where the data is partitioned. Upon an erasure request, only the model components or shards trained on the data containing the user's information are retrained.
The correct answer is to implement a sharded training architecture. The 'right to be forgotten' under regulations like GDPR requires that a data subject's personal data be erased, including its influence within a trained machine learning model. Simply retraining the entire model from scratch for each request is computationally and financially prohibitive. The SISA (Sharded, Isolated, Sliced, and Aggregated) framework is a 'machine unlearning' technique designed for this purpose. By training an ensemble of models on separate partitions (shards) of the data, only the specific shard containing the user's data needs to be retrained, effectively removing the data's influence in a much more efficient manner.
The pseudonymization approach is incorrect for two reasons: First, pseudonymized data may still be considered personal data under GDPR if the individual can be re-identified, thus not satisfying the erasure requirement. Second, waiting for a scheduled quarterly refresh does not meet the 'without undue delay' (typically one month) stipulation for erasure requests.
The differential privacy approach is incorrect because it is a preventative technique designed to protect the privacy of individuals in the aggregate output of a model. It does not provide a mechanism to actively remove a specific individual's data from the model after it has been trained.
Using a suppression list at inference time is incorrect because although it prevents the model from being used for that specific user, their data remains embedded in the trained model's parameters, influencing its overall behavior and predictions for other users. This fails to meet the core requirement of erasing the data.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the SISA (Sharded, Isolated, Sliced, and Aggregated) framework?
Open an interactive chat with Bash
Why is pseudonymization not sufficient for the 'right to erasure' under GDPR?
Open an interactive chat with Bash
How does differential privacy differ from techniques for data erasure?