XDC 2021 | Fast Checkpoint Restore for AMD GPUs with CRIU | Felix Kuehling & Rajneesh Bhardwaj, AMD

X.Org Foundation
X.Org Foundation
897 بار بازدید - 3 سال پیش - X.org Developers Conference 2021 -
X.org Developers Conference 2021 - September 15-17, 2021 - https://xdc2021.x.org/
Slides and materials: https://indico.freedesktop.org/event/...
Our work-in-progress code: https://github.com/RadeonOpenCompute/...
Further reading: https://github.com/RadeonOpenCompute/...


CRIU a.k.a Checkpoint Restore in Userspace is the de-facto choice for Checkpoint and Restore but one of its major limitations is to Checkpoint and Restore tasks that have a device state associated with them and need the driver to manage their state which CRIU cannot control but provides a flexible plugin mechanism to achieve this. So far there is no serious real device plugin (at least in public domain) that deals with a complex device such as a GPU. We would like to discuss our work to support CRIU with AMD ROCm which is AMD's fully open source solution to Machine Learning and HPC compute space. This will potentially be extended to support video decode / encode using render nodes.


CRIU already has a plugin architecture to support processes using device files. Using this architecture we added a plugin for supporting CRIU with GPU compute applications running on the AMD ROCm software stack. This requires new ioctls in the KFD kernel mode driver to save and restore hardware and kernel mode driver state, such as memory mappings, VRAM contents, user mode queues, and signals. We also needed a few new plugin hooks in CRIU itself to support remapping of device files and mmap offsets within them, and finalizing GPU virtual memory mappings and resuming execution of the GPU after all VMAs have been restored by the PIE code.


The result is the first real-world plugin and the first example of GPU support in CRIU.


We are going to present the architecture of our plugin, how it interacts with CRIU and our GPU driver during the checkpoint and restore flow. We can also talk about some security considerations and initial test results and performance stats.
3 سال پیش در تاریخ 1400/06/31 منتشر شده است.
897 بـار بازدید شده
... بیشتر