Normalizing GPT on the unit-hypersphere (WITH CODE)

Описание к видео Normalizing GPT on the unit-hypersphere (WITH CODE)

To try everything Brilliant has to offer—free—for a full 30 days, visit https://brilliant.org/Tunadorable . You’ll also get 20% off an annual premium subscription.

My Twitter, LinkedIn, Discord, Patreon, consultation booking page, etc:
https://linktr.ee/tunadorable

nGPT: Normalized Transformer with Representation Learning on the Hypersphere
ArXiv: https://arxiv.org/abs/2410.01131v1
Bytez: https://bytez.com/docs/arxiv/2410.011...
AlphaXiv: https://alphaxiv.org/abs/2410.01131v1

Nvidia's official implementation:
https://github.com/NVIDIA/ngpt
My code, most of which I wrote before realizing Nvidia had open-sourced theirs:
https://github.com/evintunador/nGPT
Other people's replications:
https://github.com/lucidrains/nGPT-py...

This video was sponsored by Brilliant

An in-depth breakdown of how LayerNorm works:
   • Geometry and Dynamics of LayerNorm  

Timestamps:
00:00 intro
01:29 sponsor
02:47 optimizing parameters on the hypersphere
14:05 transformers as variable-metric optimizers
17:56 results sneak-peak
18:15 embedding & unembedding
22:57 when to cosine normalize
24:22 scaling output logits
25:35 residual connections
31:22 NO post-norm scaling
31:47 attention
36:20 MLP
39:52 scaling specific parameters by floats
43:30 no weight decay or lr warmup
44:30 experiments
52:55 more on scale parameters
54:50 similarity to QK normalization
56:20 decoupling Eigen learning rates from Attn/MLP
58:14 absolute valuing Eigen learning rates
58:50 model parameters
59:00 time cost per step
1:00:00 parameter initializations
1:01:40 learning rate
1:02:04 rest of my code
01:04:56 outro

Комментарии

Информация по комментариям в разработке