Mixture of Experts (MoE) + Switch Transformers: Build MASSIVE LLMs with CONSTANT Complexity!

Quick Tutorials
Quick Tutorials
581 بار بازدید - 7 ماه پیش - 🚀In this video, we present
🚀In this video, we present a quick tutorial on Switch Transformers by which you can scale up any transformer-based deep learning model such as Large Language Models (LLMs) to trillion parameters with constant complexity within a Mixture of Experts (MoE) framework during both the training and inference times! In fact, this tutorial is a visual guide for original Transformers, Self-Attention Mechanism, Multi-Head Self-Attention mechanism and Switch Transformers.

🚀The original paper for Switch Transformers is this:

W. Fedus et al, "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity," 2022

🚀As mentioned in this paper, in deep learning, conventional models reuse parameters for all inputs, while Mixture of Experts (MoE) models diverge by selecting unique parameters for each input. This creates a sparsely-activated model with numerous parameters but a consistent computational cost. Despite MoE successes, challenges like complexity, communication costs, and training instability hinder widespread adoption. The Switch Transformer addresses these issues, simplifying the MoE routing algorithm, designing improved models with reduced costs, and introducing training techniques to mitigate instabilities. The research in this paper demonstrates that large sparse models can now be trained with lower precision (bfloat16) formats. The proposed models, based on T5-Base and T5-Large, achieve up to 7x increases in pre-training speed with the same computational resources, extending benefits to multilingual settings with improvements over mT5-Base across 101 languages.

⭐️HashTags ⭐️
#nlp #sparsity #transformers #largelanguagemodels #ai #gpt #gpt4 #chatgpt #switch #llama #deeplearning #complexity #computerscience #datascience #attention #moe
7 ماه پیش در تاریخ 1402/10/22 منتشر شده است.
581 بـار بازدید شده
... بیشتر