New Discovery: LLMs have a Performance Phase

code_your_own_AI منتشر شده در تاریخ 1403/03/15

13.9 هزار بار بازدید - 3 هفته پیش - Grokking is a new phase

Grokking is a new phase in the performance of LLMs. Starting with arithmetic operations, we analyze the patterns in the embedded space of Transformers.

Grokking refers to a phenomenon where, after extensive training beyond typical saturation points, transformers can generalize effectively to unseen data, achieving high performance long after initial overfitting occurs. This discovery challenges conventional wisdom about early stopping to prevent overfitting, revealing that extended training can lead to superior generalization. The video highlights various studies demonstrating this effect, including an MIT study that observed geometric structures forming within the embedding space of a simple transformer model during prolonged training. These structures, such as circles and parallelograms, indicate that the model has internalized the underlying mathematical rules of tasks like modular arithmetic, leading to precise generalization.

Moreover, this amazing video underscores the implications of grokking for complex reasoning tasks, where grokked transformers exhibit remarkable accuracy without the need for retrieval augmentation (RAG) or complex prompting strategies. This capability is especially significant for applications requiring advanced reasoning, as it simplifies the preparation and structuring of training datasets. The text illustrates that grokking involves the formation of structured representations within the model's embedding matrix, suggesting a deep connection between prolonged training, geometric embedding structures, and effective generalization. The practical impact of this discovery is profound, potentially transforming approaches to training AI systems for tasks that demand high levels of reasoning and generalization, and paving the way for more robust and capable AI applications.

GROKKING: GENERALIZATION BEYOND OVERFIT-
TING ON SMALL ALGORITHMIC DATASETS
https://arxiv.org/pdf/2201.02177

Towards Understanding Grokking:
An Effective Theory of Representation Learning
https://arxiv.org/pdf/2205.10343

The Slingshot Effect: A Late-Stage Optimization Anomaly in Adaptive Gradient Methods
https://openreview.net/forum?id=OZbn8...

#airesearch
#airevolution

3 هفته پیش در تاریخ 1403/03/15 منتشر شده است.

13,912 بـار بازدید شده

... بیشتر