{"id":21076,"date":"2025-12-18T19:30:41","date_gmt":"2025-12-18T19:30:41","guid":{"rendered":"https:\/\/scannn.com\/the-next-generation-of-encoder-decoder-models\/"},"modified":"2025-12-18T19:30:41","modified_gmt":"2025-12-18T19:30:41","slug":"the-next-generation-of-encoder-decoder-models","status":"publish","type":"post","link":"https:\/\/scannn.com\/lv\/the-next-generation-of-encoder-decoder-models\/","title":{"rendered":"The next generation of encoder-decoder models"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p data-block-key=\"7j9y9\">T5Gemma 2 is the next evolution of our encoder-decoder family based on Gemma 3, featuring the first multi-modal and long-context encoder-decoder models.<\/p>\n<p data-block-key=\"d5qcc\">Unlike T5Gemma, T5Gemma 2 adopts tied word embeddings (over encoder and decoder) and merged decoder self- and cross-attention to save model parameters. It offers compact pre-trained models at sizes of 270M-270M (~370M total, excluding vision encoder), 1B-1B (~1.7B) and 4B-4B (~7B) parameters, making them ideal for rapid experimentation and deployment in on-device applications.<\/p>\n<h2 data-block-key=\"amrfl\">Background<\/h2>\n<p data-block-key=\"f86n5\">With the original T5Gemma, we demonstrated that we could successfully adapt modern, pre-trained decoder-only models into an encoder-decoder architecture, unlocking new versatility. By initializing with weights from a powerful decoder-only model and then applying continued pre-training, we created high-quality, inference-efficient models while bypassing the computational cost of training from scratch.<\/p>\n<p data-block-key=\"fljar\">T5Gemma 2 extends this into the realm of vision-language models by incorporating key innovations from Gemma 3.<\/p>\n<h2 data-block-key=\"dlcqj\">What\u2019s new<\/h2>\n<p data-block-key=\"3mg06\">T5Gemma 2 is more than a re-training. It incorporates significant architectural changes while inheriting many of the powerful, next-generation features of the Gemma 3 family.<\/p>\n<h3 data-block-key=\"2sigb\">Architectural innovations for efficiency<\/h3>\n<p data-block-key=\"djbnv\">To maximize efficiency at smaller scales, we have introduced key structural refinements:<\/p>\n<ul>\n<li data-block-key=\"8jb3m\"><b>Tied embeddings:<\/b> We now tie the embeddings between the encoder and decoder. This significantly reduces the overall parameter count, allowing us to pack more active capabilities into the same memory footprint \u2014 crucial for our new compact 270M-270M model.<\/li>\n<li data-block-key=\"81pdv\"><b>Merged attention:<\/b> In the decoder, we adopt a merged attention mechanism, combining self- and cross-attention into a single, unified attention layer. This reduces model parameters and architectural complexity, improving model parallelization and benefiting inference.<\/li>\n<\/ul>\n<h3 data-block-key=\"c6ibb\">Next-generation capabilities<\/h3>\n<p data-block-key=\"2ls8k\">Drawing from Gemma 3, T5Gemma 2 also represents a significant upgrade in model capabilities:<\/p>\n<ul>\n<li data-block-key=\"cln6a\"><b>Multimodality:<\/b> T5Gemma 2 models can understand and process images alongside text. By utilizing a highly efficient vision encoder, the models can seamlessly perform visual question answering and multimodal reasoning tasks.<\/li>\n<li data-block-key=\"7b1hr\"><b>Extended long context:<\/b> We&#8217;ve dramatically expanded the context window. Leveraging Gemma 3&#8217;s alternating local and global attention mechanism, T5Gemma 2 can handle context windows of up to 128K tokens.<\/li>\n<li data-block-key=\"f1t8a\"><b>Massively multilingual:<\/b> Trained on a larger, more diverse dataset, these models now support over 140 languages out of the box.<\/li>\n<\/ul>\n<h2 data-block-key=\"6i3nf\">Performance<\/h2>\n<p data-block-key=\"efjl4\">T5Gemma 2 sets a new standard for what compact encoder-decoder models can achieve. Our new models demonstrate strong performance across key capability areas, inheriting the powerful multimodal and long-context features from the Gemma 3 architecture.<\/p>\n<\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/blog.google\/technology\/developers\/t5gemma-2\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>T5Gemma 2 is the next evolution of our encoder-decoder family based on Gemma 3, featuring the first multi-modal and long-context encoder-decoder models. Unlike T5Gemma, T5Gemma 2 adopts tied word embeddings (over encoder and decoder) and merged decoder self- and cross-attention to save model parameters. It offers compact pre-trained models at sizes of 270M-270M (~370M total, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":21077,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[100],"tags":[],"class_list":["post-21076","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-google"],"_links":{"self":[{"href":"https:\/\/scannn.com\/lv\/wp-json\/wp\/v2\/posts\/21076","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/scannn.com\/lv\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/scannn.com\/lv\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/scannn.com\/lv\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/scannn.com\/lv\/wp-json\/wp\/v2\/comments?post=21076"}],"version-history":[{"count":0,"href":"https:\/\/scannn.com\/lv\/wp-json\/wp\/v2\/posts\/21076\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/scannn.com\/lv\/wp-json\/wp\/v2\/media\/21077"}],"wp:attachment":[{"href":"https:\/\/scannn.com\/lv\/wp-json\/wp\/v2\/media?parent=21076"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/scannn.com\/lv\/wp-json\/wp\/v2\/categories?post=21076"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/scannn.com\/lv\/wp-json\/wp\/v2\/tags?post=21076"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}