A large online-retail company must share a click-stream dataset containing more than 100 million rows with a university research partner. The partner needs to join all events that belong to the same customer across multiple browsing sessions in order to train sequence-based recommender models. At the same time the company must guarantee that the real customer identifier cannot be reconstructed, even by an attacker with extensive auxiliary information, and it does not want to maintain any reversible token vault or mapping table. Analytical accuracy at the individual-event level must be preserved. Which anonymization approach BEST meets these business and compliance requirements?
Hash the customer identifier with a secret, randomly generated salt using a cryptographically secure one-way hash function and share only the resulting digest in the dataset.
Tokenize the customer identifier using reversible format-preserving encryption and store the mapping table in an internal secure vault.
Provide access through a query service that injects differential-privacy noise into every aggregate result instead of releasing the raw rows.
Replace the customer identifier with a new random integer generated separately for each browsing session.
Applying a salted, cryptographically secure one-way hash to the customer identifier is deterministic, so every occurrence of the same identifier produces the same pseudonym and allows the research partner to link events across sessions. Because the secret salt is never shared, the transformation is computationally infeasible to reverse, eliminating the need for a token vault while still preserving full event-level utility.
Tokenization with format-preserving encryption is reversible by design and therefore requires the organisation to manage a secure mapping table, which violates the stated constraint. An interactive query service that adds differential-privacy noise protects identities but returns only aggregates, preventing the partner from building sequence-based models that rely on row-level linkage. Replacing the identifier with a random value generated per session destroys cross-session linkability, making the data unusable for the required analysis.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is a cryptographically secure one-way hash function?
Open an interactive chat with Bash
Why is using a salt critical in this anonymization approach?
Open an interactive chat with Bash
How does this approach preserve analytical accuracy?