Ring Attention enables context lengths of 1 mio tokens for our latest LLMs and VLMs. How is this possible? What happens to the quadratic complexity of self-attention on sequence length?
In this video, I explain the Block Parallel Transformer idea from UC Berkeley to the actual code implementation on Github for Ring Attention with blockwise transformer.
Current Google Gemini 1.5 Pro has a context length of 1 mio tokens on Vertex AI.
00:00 3 ways for infinite context lengths
02:05 Blockwise Parallel Transformers
03:11 Q, K, V explained in a library
06:10 BPT explained in a library
11:30 Maths for blockwise parallel transformers
12:41 Ring attention symmetries
14:25 Ring attention explained
19:52 Ring attention JAX code
23:59 Outlook: Infini Attention by Google
All rights w/ authors:
Ring Attention with Blockwise Transformers for Near-Infinite Context
https://arxiv.org/pdf/2310.01889.pdf
#airesearch
#ai
3 Comments