You are training word embeddings for a morphologically-rich language whose corpus contains many inflected forms that may not re-appear at inference time. To address the out-of-vocabulary problem you replace the classic skip-gram model, which treats each token as an indivisible symbol, with a variant that represents every word as the sum of its character n-gram vectors (for example all 3- to 6-character substrings plus the whole word). Which concrete benefit does this n-gram-based representation provide over the original Word2Vec model?
It can synthesize embeddings for unseen words by composing their character n-gram vectors at inference time.
It makes negative sampling unnecessary during training because n-gram vectors inherently separate frequent and rare words.
It guarantees lower-dimensional embeddings because each n-gram acts as an orthogonal basis, allowing dimensions to be dropped without information loss.
It removes the need to specify a context window, since n-gram structure alone captures all contextual dependencies.
Because the model learns vectors for every character n-gram, it can build an embedding for a word it never saw in the training corpus by adding together the vectors of its constituent n-grams. This directly reduces the out-of-vocabulary (OOV) problem. The dimensionality of the embedding space is not automatically reduced by using n-grams, so dimensionality savings are not guaranteed. A context window is still needed during training to learn distributional semantics, so n-grams do not eliminate that hyper-parameter. Finally, loss functions such as negative sampling or hierarchical softmax are still required to train the model efficiently; subword vectors do not obviate them.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the out-of-vocabulary (OOV) problem in word embeddings?
Open an interactive chat with Bash
How do character n-grams help in creating embeddings for unseen words?
Open an interactive chat with Bash
Why do subword embeddings still require context windows and negative sampling during training?