How to run Mixtral LLM on your Laptop - January 26, 2024 - Exciting AI Updates
593 بار بازدید -
6 ماه پیش
-
How to run Mixtral LLM
How to run Mixtral LLM on your Laptop - January 26, 2024 - Exciting AI Updates
Presented by Denis Mazur & Artyom Eliseev
Slides - https://github.com/lselector/seminar/...
Run Mixtral on Nvidia 3060 with 12GB !
- https://arxiv.org/abs/2312.17238 - paper
- https://github.com/dvmazur/mixtral-of...
- Twitter: 1741103866047869222
Very elegant work.
Original Mixtral requires more than 90 GB of memory.
Almost 97% of this size is taken by Feed-Forward Networks in transformer layers. Authors used multiple ways to decrease the memory requirements while keeping the accuracy of the model.
Authors tested multiple quantization methods and selected a flexible quantization scheme where different parts of the network quantized differently.
To achieve further decrease in GPU memory requirements, authors have implemented the dynamic loading/offloading of experts networks in transformer layers. They used "speculative" loading - trying to predict and load only parts of Feed-Forward Network experts networks (as needed).
As a result they have demonstrated that you can run Mixtral on a modest laptop with Nvidia 3060 with only 12GB with decent (practical) performance.
Denis Mazur
- https://github.com/dvmazur
- https://huggingface.co/dvmazur
Artyom Eliseev
- https://github.com/lavawolfiee
- https://huggingface.co/lavawolfiee
My websites:
- Enterprise AI Solutions - https://EAIS.ai
- Linkedin - LinkedIn: levselector
- GitHub - https://github.com/lselector
Presented by Denis Mazur & Artyom Eliseev
Slides - https://github.com/lselector/seminar/...
Run Mixtral on Nvidia 3060 with 12GB !
- https://arxiv.org/abs/2312.17238 - paper
- https://github.com/dvmazur/mixtral-of...
- Twitter: 1741103866047869222
Very elegant work.
Original Mixtral requires more than 90 GB of memory.
Almost 97% of this size is taken by Feed-Forward Networks in transformer layers. Authors used multiple ways to decrease the memory requirements while keeping the accuracy of the model.
Authors tested multiple quantization methods and selected a flexible quantization scheme where different parts of the network quantized differently.
To achieve further decrease in GPU memory requirements, authors have implemented the dynamic loading/offloading of experts networks in transformer layers. They used "speculative" loading - trying to predict and load only parts of Feed-Forward Network experts networks (as needed).
As a result they have demonstrated that you can run Mixtral on a modest laptop with Nvidia 3060 with only 12GB with decent (practical) performance.
Denis Mazur
- https://github.com/dvmazur
- https://huggingface.co/dvmazur
Artyom Eliseev
- https://github.com/lavawolfiee
- https://huggingface.co/lavawolfiee
My websites:
- Enterprise AI Solutions - https://EAIS.ai
- Linkedin - LinkedIn: levselector
- GitHub - https://github.com/lselector
6 ماه پیش
در تاریخ 1402/11/06 منتشر شده
است.
593
بـار بازدید شده