Run Llama 2 with 32k Context Length!

Trelis Research
Trelis Research
5.5 هزار بار بازدید - 11 ماه پیش - Achieve long context length by
Achieve long context length by using Code Llama to scale to 32k tokens.
- Get up to 16k tokens on a Colab 40 GB GPU
- Get up to 32k tokens on an 80 GB A100 on RunPod (or AWS or Azure)

Tips:
- Use Flash attention and BetterTransformer
- Use GPTQ quantization
- Use the 13B model for better quality.

Free jupyter notebook: https://github.com/TrelisResearch/cod...

Purchase the PRO notebook: https://buy.stripe.com/fZe14Q5tP0zpaM...
- Allows for saving and re-loading of conversations
- Allows for uploading and analysis of documents
- Works on Google Colab or on a Server, e.g. AWS, Azure, RunPod (affiliate link: https://tinyurl.com/yjxbdc9w)

Trelis ADVANCED Inference Repo:
- Server Setup
- API setup with Runpod
- Function calling api scripts
Learn more: https://trelis.com/enterprise-server-...

0:00 How to run Llama 2 with longer context length
0:50 Run Llama 2 with 16k context in Google Colab
2:20 How to run a GPTQ model in Colab
3:43 Run Llama 2 7B with 32k context length using RunPod
6:20 Run Llama 2 13B for better performance! 16k context length
8:15 Streaming Llama 2 13B on 16k context length
9:50 Adjusting max token output and temperature
10:20 Streaming Llama 2 13B on 16k context length and 0 temperature
11:25 STREAMING LLAMA 2 13B ON 32k CONTEXT LENGTH!
12:50 PRO NOTEBOOK - Save Chats and Files. Easily adjust context length.
16:40 THEORY BONUS: How to get longer context length?
17:45 How does GPTQ work?
18:00 How does Flash attention work?
19:45 What is the best model for long context length?
20:20 What is better Llama 2 or Code-llama or YaRN?
21:30 Tips for long context lengths
11 ماه پیش در تاریخ 1402/06/17 منتشر شده است.
5,511 بـار بازدید شده
... بیشتر