d-Matrix Corsair: Transformative Generative Inference from Unsustainable to Attainable

UCFCompArch منتشر شده در تاریخ 1403/06/01

114 بار بازدید - 4 هفته پیش - Speaker: Gaurav Jain is a

Speaker: Gaurav Jain is a software engineer at d-Matrix where he is leading the efforts for building the kernel software stack for next-generation in-memory compute LLM inference hardware. Along with this, he’s also involved in the research and exploration of techniques aiming towards the improvement of end-to-end model performance and reducing memory overhead. Prior to this he was a silicon architect in the Google Pixel TPU machine learning accelerator team, where his day-to-day activities involved architecture specification, performance modeling and workload characterization for Google’s machine learning workloads. Gaurav holds a master’s degree in electrical and computer engineering from the University of Wisconsin-Madison. His research interests span across multiple domains, including model optimization, ML systems, hardware-software codesign and computer architecture. Abstract: In this rapidly changing world of large language models (LLMs), a common theme that prevails is the affordability of running inference on them. It’s been less than two years since OpenAI first released ChatGPT and since then, enterprises and academia have gone full throttle in the research of improving the overall efficiency and affordability of the deployment of LLMs for data-center inference. While these techniques have provided significant improvements on existing hardware solutions, there is a need for a new and radical approach to handling the different challenges presented by LLMs including their low-reuse, high memory-bandwidth requirements. In that regard, d-Matrix is approaching these challenges by designing a first-of-its-kind datacenter-scale chiplet-based in-memory computing platform and a corresponding software stack that makes serving LLMs affordable and efficient. In this talk, Gaurav will go over the dataflow computing paradigm that enables them to address the memory bandwidth boundedness of LLM-inference, how their SRAM-based in-memory compute is different from any of the previous solutions, and how they are leveraging the PyTorch and MLIR stack to achieve 3x to 20x improvement in inference latency for state-of-the-art LLMs.

4 هفته پیش در تاریخ 1403/06/01 منتشر شده است.

114 بـار بازدید شده

... بیشتر