In a Colab Notebook we code a visualization of the last layer of the Vision Transformer Encoder stack and analyze the visual output of each of the 12 Attention Heads, given a specific image. Now we understand how a only pre-trained ViT (although with the DINO method) can not always succeed in an image classification (downstream) task. The fine-tuning of the ViT is simply missing - but essential for a better performance.
Based on the COLAB NB by Niels Rogge, HuggingFace (all rights with him):
https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/DINO/Visualize_self_attention_of_DINO.ipynb
In one of my next video we will fine-tune a pre-trained Vision Transformer ViT from scratch. For better image classification performance.
#ai
#vision
#technology
2 Comments