A machine learning engineer is upgrading a natural language processing pipeline that uses an RNN-based architecture for machine translation. The existing model struggles with long-term dependencies in lengthy sentences and faces slow training times due to its sequential nature. To address these issues, the engineer decides to implement a Transformer model. Which core component of the Transformer architecture directly addresses both the challenge of capturing long-range dependencies and the bottleneck of sequential processing?
The correct answer is the self-attention mechanism. This mechanism is the fundamental innovation of the Transformer architecture. It allows the model to weigh the importance of all other words in the input sequence when encoding a specific word. By creating direct paths between any two tokens in the sequence, regardless of their distance, it effectively captures long-range dependencies. This parallel processing of all tokens at once overcomes the sequential bottleneck inherent in RNNs, where information and gradients must pass through many intermediate steps, leading to slow training and the vanishing gradient problem.
Positional encodings are incorrect because their purpose is to inject information about the sequence order of the tokens, which the self-attention mechanism itself does not capture. They are necessary for the model to understand word order but do not directly model the dependencies.
The encoder-decoder stack is a high-level structure that also exists in many RNN-based models for sequence-to-sequence tasks. Therefore, it is not the unique component in Transformers that solves the stated problems. The innovation lies in how the encoder and decoder are constructed (i.e., with self-attention layers), not the stack itself.
Residual connections and layer normalization are general deep learning techniques used to stabilize training and improve gradient flow in deep networks, including Transformers. They help mitigate the vanishing gradient problem in the context of network depth but do not solve the long-range dependency issue caused by sequential data processing in RNNs.
Ask Bash
Bash is our AI bot, trained to help you pass your exam. AI Generated Content may display inaccurate information, always double-check anything important.
What is the self-attention mechanism in a Transformer model?
Open an interactive chat with Bash
How do positional encodings work in a Transformer model?
Open an interactive chat with Bash
Why are RNNs slow at handling long sequences compared to Transformers?