If LLMs are text models, how do they generate images? (Transformers + VQVAE explained)

Описание к видео If LLMs are text models, how do they generate images? (Transformers + VQVAE explained)

In this video, I talk about Multimodal LLMs, Vector-Quantized Variational Autoencoders (VQ-VAEs), and how modern models like Google's Gemini, Parti, and OpenAI's DallE generate images together with text. I tried to cover a lot of bases starting from the very basics (latent space, autoencoders), all the way to more complex topics (like VQ-VAEs, codebooks, etc).

Follow on Twitter: @neural_avb
#ai #deeplearning #machinelearning

To support the channel and access the Word documents/slides/animations used in this video, consider JOINING the channel on Youtube or Patreon. Members get access to Code, project files, scripts, slides, animations, and illustrations for most of the videos on my channel! Learn more about perks below.

Join and support the channel - https://www.youtube.com/@avb_fj/join
Patreon -   / neuralbreakdownwithavb  

Interesting videos/playlists:
Multimodal Deep Learning -    • Multimodal AI from First Principles -...  
Variational Autoencoders and Latent Space -    • Visualizing the Latent Space: This vi...  
From Neural Attention to Transformers -    • Attention to Transformers from First ...  

Papers to read:
VAE - https://arxiv.org/abs/1312.6114
VQ-VAE - https://arxiv.org/abs/1711.00937
VQ-GAN - https://compvis.github.io/taming-tran...
Gemini - https://assets.bwbx.io/documents/user...
Parti - https://sites.research.google/parti/
DallE - https://arxiv.org/pdf/2102.12092.pdf

Timestamps:
0:00 - Intro
3:49 - Autoencoders
6:16 - Latent Spaces
9:50 - VQ-VAE
11:30 - Codebook Embeddings
14:40 - Multimodal LLMs generating images

Комментарии

Информация по комментариям в разработке