Vision Transformers (ViT) Explained + Fine-tuning in Python

Описание к видео Vision Transformers (ViT) Explained + Fine-tuning in Python

Vision and language are the two big domains in machine learning. Two distinct disciplines with their own problems, best practices, and model architectures. At least, that was the case.

The Vision Transformer (ViT) marks the first step towards the merger of these two fields into a single unified discipline. For the first time in the history of ML, a single model architecture has come to dominate both language and vision.

Before ViT, transformers were "those language models" and nothing more. Since then, ViT and further work has solidified them as a likely contender for the architecture that merges the two disciplines.

This video will dive into ViT, explaining and visualizing the intuition behind how and why it works. We will see how to implement it using the Hugging Face transformers library in Python. Then use it for image classification.

🌲 Pinecone article:
https://www.pinecone.io/learn/vision-...

Code:
https://github.com/pinecone-io/exampl...

🤖 AI Dev Studio:
https://aurelio.ai

👾 Discord:
  / discord  

00:00 Intro
00:58 In this video
01:12 What are transformers and attention?
01:39 Attention explained simply
04:15 Attention used in CNNs
05:24 Transformers and attention
07:01 What vision transformer (ViT) does differently
07:28 Images to patch embeddings
08:22 1. Building image patches
10:23 2. Linear projection
10:57 3. Learnable class embedding
13:30 4. Adding positional embeddings
16:37 ViT implementation in python with Hugging Face
16:45 Packages, dataset, and Colab GPU
18:42 Initialize Hugging Face ViT Feature Extractor
22:48 Hugging Face Trainer setup
25:14 Training and CUDA device error
26:27 Evaluation and classification predictions with ViT
28:54 Final thoughts

#machinelearning #deeplearning #ai #python

Комментарии

Информация по комментариям в разработке