Combined Vision-Language Transformers, interlinked w/ a Q-Former, a Querying Transformer! BLIP 2. BLIP-2!
The financial resources for pre-training both systems (Vision and Language) are astronomical? Let me introduce you to a clever, new training method: BLIP-2. Multimodal Large Language Models for visual QA or perception-language tasks, multimodal dialogue, or image captioning, and image recognition with verbal content descriptions, plus a Chat function.
Visual Perception and Large Language Models: The new combination in Transformers. Multi-modal Large Language Models for visual QA or image captioning.
All rights and credits w/:
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
https://arxiv.org/abs/2301.12597
#ai
#machinelearning
#chatgpt
#vision
#llm
#BLIP2
#QFormer
17 Comments