A data scientist is training a transformer-based model for a nuanced sentiment analysis task on a small, specialized corpus. The model demonstrates high accuracy on the training set but generalizes poorly to unseen data, indicating overfitting. The primary goal is to augment the dataset to improve model robustness by generating syntactically diverse but semantically consistent new samples. Which of the following augmentation techniques is most suitable for this scenario?
The correct answer is back-translation. This technique involves translating text to another language and then back to the original. This process often results in sentences that are syntactically different but semantically very similar to the original, which is ideal for creating high-quality, diverse training data to combat overfitting. Context-unaware synonym replacement can easily alter the sentiment and meaning of the text, especially in nuanced cases. Random word deletion and insertion are likely to create grammatically incorrect or nonsensical sentences, introducing harmful noise rather than useful variations. Stop word removal is a text preprocessing or cleaning step, not a data augmentation technique; its purpose is to reduce dimensionality, not to create new training samples.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is back-translation, and why is it effective for data augmentation?
Open an interactive chat with Bash
How does context-unaware synonym replacement differ from back-translation?
Open an interactive chat with Bash
Why are random word deletion and stop word removal not suitable for data augmentation?