True Multimodal RAG - Audio/Image/Video/Text

Adam Lucek
Adam Lucek
1.7 هزار بار بازدید - ماه قبل - Everyone knows general text based
Everyone knows general text based vector databases, and text based RAG for LLM applications, but as it turns out thats just the beginning! Taking advantage of CLIP & CLAP models along with some fancy tricks, we embed 25,000 text entries, 1999 pictures, 2000 audio files, and 99 videos into a single vector database, allowing us to run direct text to text/audio/image/video retrieval!

Resources:
Multimodal Image RAG Video:
Code: https://github.com/ALucek/true-multim...
Colab Notebook: https://colab.research.google.com/dri...

Chapters:
00:00 - Intro
01:04 - CLIP Model Review
02:08 - CLAP Model Overview
02:35 - Modality 1: Audio Setup & Dataset
03:45 - Modality 1: Custom Audio Embedding & Loader Functions
05:40 - Modality 1: Audio Embedding & Testing Retrieval
07:38 - Modality 2: Image Setup & Dataset
08:52 - Modality 2: Image Embedding & Testing Retrieval
09:46 - Modality 3: Text Setup & Dataset
10:24 - Modality 3: Text Embedding
12:22 - Modality 3: Testing Text Retrieval
13:06 - Modality 4: Video Setup & Methodology
15:06 - Modality 4: Video Dataset & Embedding
16:22 - Modality 4: Testing Video Retrieval
17:10 - Full Multimodal Retrieval!
18:34 - RAG: Setup
19:26 - RAG: Prompt Setup
20:25 - RAG: Full Multimodal Retrieval Augmented Generation
21:15 - Outro

#ai #coding #generativeai
ماه قبل در تاریخ 1403/04/25 منتشر شده است.
1,723 بـار بازدید شده
... بیشتر