Encoder vs Decoder Architecture

Explore the fundamental differences between encoder and decoder architectures in sequence-to-sequence models, focusing on their distinct roles in processing inp

Ask your own question

In depth

Encoder-decoder architectures are fundamental to sequence-to-sequence models, enabling tasks like machine translation by processing input and generating output in a structured way. Understanding their distinct roles is key to grasping how these powerful models function.

The Encoder: Understanding Context

The encoder's primary role is to process the entire input sequence simultaneously, building a comprehensive understanding of its context. For instance, in translating "Je t'aime," the encoder analyzes "Je," "t'," and "aime" together. It transforms this input into a dense, mathematical representation—often called a "context map" or "latent space"—that encapsulates the meaning of the entire input sequence. This map is a rich, numerical summary that the decoder can then interpret.

The Decoder: Generating Output Step-by-Step

In contrast, the decoder generates the output sequence one element at a time. After receiving the encoder's context map, the decoder starts producing the output, for example, generating "I," then "love," and finally "you." A crucial aspect of the decoder's operation is its use of masked attention. This mechanism prevents the decoder from "looking ahead" at future tokens in its own output sequence, ensuring that each token is generated based only on previously generated tokens and the encoder's context.

Cross-Attention: Bridging the Gap

The connection between the encoder and decoder is established through a mechanism called cross-attention. As the decoder generates each output token, it queries the encoder's context map. Cross-attention allows the decoder to dynamically focus on the most relevant parts of the input sequence to generate the current output token. For instance, when the decoder needs to generate "love," cross-attention helps it identify and focus on the corresponding part of the input, such as "aime," within the encoder's context map.

Key Takeaways

Encoders process the entire input sequence simultaneously to create a context map.
Decoders generate the output sequence one element at a time.
Masked attention in decoders prevents looking ahead at future output tokens.
Cross-attention links the decoder to the encoder's context map, enabling focused information retrieval.
Encoders build a map of meaning, while decoders navigate it to produce the final output.

Got a different question? SeaThru generates a fresh video for any topic where systems talk or data structures move.

Ask your own question →