【译文】
我们从零开始复制GPT-2(124M)。这段视频涵盖了整个过程:首先,我们构建GPT-2网络,然后优化其训练过程以使其运行得非常快,接着,我们按照GPT-2和GPT-3的论文及其超参数设置训练运行参数,然后启动运行,第二天早上回来查看我们的结果,并欣赏一些有趣的模型生成。请记住,在某些地方,本视频基于从零到英雄播放列表(见我的频道)中早期视频的知识。你也可以将这个视频看作构建我的nanoGPT仓库,到最后大约90%的相似性。
链接:
在 build-nanogpt GitHub 仓库中,将本视频中提到的所有更改作为单独的提交:https://github.com/karpathy/build-nan...
nanoGPT 代码库:https://github.com/karpathy/nanoGPT
llm.c repo: https://github.com/karpathy/llm.c
我的网站:https://karpathy.ai
补充链接:
注意力就是你所需要的全部:https://arxiv.org/abs/1706.03762
OpenAI GPT-3论文:https://arxiv.org/abs/2005.14165- OpenAI GPT-2论文:https://d4mucfpksywv.cloudfront.net/b...我正在训练模型的GPU来自Lambda GPU Cloud,我认为在云中生成按需GPU实例的最好和最简单的方法是:https://lambdalabs.com
章节:
00:00:00简介:让我们重现GPT-2(124M)
00:03:39探索GPT-2(124M)OpenAI检查点
00:13:47第一节:实施GPT-2网络。模块
00:28:08 加载拥抱脸/GPT-2参数
00:31:00 执行转发通行证以获取登录信息
00:33:31 采样初始化,前缀令牌,令牌化
00:37:02采样循环
00:41:47 样本,自动检测设备
00:45:50 让我们训练:数据批次(B,T) → logits(B, T,C)
00:52:53交叉熵损失
00:56:42优化循环:超配单个批次
01:02:00 数据加载器 Lite
01:06:14 参数共享 wte 和 lm_head
01:13:47 模型初始化:std 0.02,剩余初始化
01:22:18 第二节:让我们快点。GPU,混合精度,1000毫秒
01:28:14 张量核,定时代码,TF32精度,333毫秒
01:39:38 float16,梯度标度器,bfloat16,300毫秒
01:48:15 torch.compile,Python开销,内核融合,130毫秒
02:00:18 闪烁注意,96毫秒
02:06:54 漂亮/丑陋的数字。词汇大小 50257 → 50304, 93毫秒
02:14:55 第三节:超级钳工,AdamW,梯度剪切
02:21:06学习速率调度器:预热+余弦衰减
02:26:21 批量大小计划,重量衰减,FusedAdamW,90毫秒
02:34:09 梯度积累
02:46:52分布式数据并行(DDP)
03:10:21 GPT-2、GPT-3、FineWeb中使用的数据集
03:23:10 验证数据分裂,验证丢失,采样恢复
03:28:23 评价:HellaSwag,开始跑步
03:43:05 第四节:早上会有结果!GPT-2和GPT-3复制体
03:56:21 向llm.c呼叫,原始C/CUDA中的代码等效但更快
03:59:39 摘要,哇,build-nanogpt github 仓库
更正:
我将在build-nanogpt GitHub仓库(上方链接)中发布所有勘误和后续更新。
超级感谢:
我昨天在频道中实验性地启用了这些功能。这完全是可选的,且仅适用于经济状况较好的用户。所有收入都将用于支持我在AI+教育领域的工作。
【原文】
We reproduce the GPT-2 (124M) from scratch. This video covers the whole process: First we build the GPT-2 network, then we optimize its training to be really fast, then we set up the training run following the GPT-2 and GPT-3 paper and their hyperparameters, then we hit run, and come back the next morning to see our results, and enjoy some amusing model generations. Keep in mind that in some places this video builds on the knowledge from earlier videos in the Zero to Hero Playlist (see my channel). You could also see this video as building my nanoGPT repo, which by the end is about 90% similar.
Links:
build-nanogpt GitHub repo, with all the changes in this video as individual commits: https://github.com/karpathy/build-nan...
nanoGPT repo: https://github.com/karpathy/nanoGPT
llm.c repo: https://github.com/karpathy/llm.c
my website: https://karpathy.ai
my twitter: / karpathy
our Discord channel: / discord
Supplementary links:
Attention is All You Need paper: https://arxiv.org/abs/1706.03762
OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165 - OpenAI GPT-2 paper: https://d4mucfpksywv.cloudfront.net/b... The GPU I'm training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com
Chapters:
00:00:00 intro: Let’s reproduce GPT-2 (124M)
00:03:39 exploring the GPT-2 (124M) OpenAI checkpoint
00:13:47 SECTION 1: implementing the GPT-2 nn.Module
00:28:08 loading the huggingface/GPT-2 parameters
00:31:00 implementing the forward pass to get logits
00:33:31 sampling init, prefix tokens, tokenization
00:37:02 sampling loop
00:41:47 sample, auto-detect the device
00:45:50 let’s train: data batches (B,T) → logits (B,T,C)
00:52:53 cross entropy loss
00:56:42 optimization loop: overfit a single batch
01:02:00 data loader lite
01:06:14 parameter sharing wte and lm_head
01:13:47 model initialization: std 0.02, residual init
01:22:18 SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms
01:28:14 Tensor Cores, timing the code, TF32 precision, 333ms
01:39:38 float16, gradient scalers, bfloat16, 300ms
01:48:15 torch.compile, Python overhead, kernel fusion, 130ms
02:00:18 flash attention, 96ms
02:06:54 nice/ugly numbers. vocab size 50257 → 50304, 93ms
02:14:55 SECTION 3: hyperpamaters, AdamW, gradient clipping
02:21:06 learning rate scheduler: warmup + cosine decay
02:26:21 batch size schedule, weight decay, FusedAdamW, 90ms
02:34:09 gradient accumulation
02:46:52 distributed data parallel (DDP)
03:10:21 datasets used in GPT-2, GPT-3, FineWeb (EDU)
03:23:10 validation data split, validation loss, sampling revive
03:28:23 evaluation: HellaSwag, starting the run
03:43:05 SECTION 4: results in the morning! GPT-2, GPT-3 repro
03:56:21 shoutout to llm.c, equivalent but faster code in raw C/CUDA
03:59:39 summary, phew, build-nanogpt github repo
Corrections:
I will post all errata and followups to the build-nanogpt GitHub repo (link above)
SuperThanks:
I experimentally enabled them on my channel yesterday. Totally optional and only use if rich. All revenue goes to to supporting my work in AI + Education.