A financial services company needs to generate a high-fidelity synthetic dataset for stress-testing its fraud detection models. The original dataset contains millions of transactions, with complex, non-linear correlations between features that must be preserved. The primary goal is to create new, unseen data that follows the same statistical distribution as the real data, rather than just re-balancing classes or masking values. Which of the following describes the most suitable creation process for this requirement?
Modify the original dataset by adding a small amount of random statistical noise to numerical columns and systematically swapping the values between different records in categorical columns. This process is repeated until a new dataset of the desired size is created.
Identify instances of the minority class within the dataset. For each instance, find its k-nearest neighbors and create new synthetic samples by interpolating between the instance and its neighbors along a randomly chosen vector.
Train two neural networks in an adversarial process. One network, the generator, attempts to create realistic data from a random input, while a second network, the discriminator, is trained to differentiate between the real data and the generator's output. The system is trained until the generator can produce data that consistently fools the discriminator.
Implement an encoder-decoder model where the encoder maps input data to a probabilistic latent space. New data is then created by taking samples from this latent space and passing them through the decoder to generate data points that follow the learned distribution.
The correct option describes the creation process of a Generative Adversarial Network (GAN). A GAN consists of two competing neural networks: a generator and a discriminator. The generator creates synthetic data from random noise, and the discriminator tries to distinguish between real and synthetic data. This adversarial process forces the generator to learn the underlying distribution of the original data, making it ideal for creating high-fidelity synthetic data that preserves complex correlations, as required by the scenario.
The option describing selecting minority instances and generating samples along line segments refers to the Synthetic Minority Over-sampling Technique (SMOTE). While useful for handling class imbalance, SMOTE is less effective at capturing the complex, multivariate distributions of an entire dataset compared to deep learning models like GANs.
The option involving adding random noise and swapping values describes data perturbation and swapping. These are primarily data anonymization or masking techniques that can degrade the complex correlations between variables, failing the key requirement of the scenario.
The option describing encoding data to a latent space and then sampling from that learned distribution to generate new data describes the process of a Variational Autoencoder (VAE). While VAEs are powerful generative models, the correct answer specifically describes the distinct, adversarial training process of a GAN, which is a key process for synthetic data generation.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What makes a Generative Adversarial Network (GAN) suitable for creating high-fidelity synthetic datasets?
Open an interactive chat with Bash
How does the training process of a GAN differ from other generative models like VAEs?
Open an interactive chat with Bash
What are some practical challenges when training GANs for synthetic data generation?